Skip to main content

Can I be smarter about late policies?

Questions: Is my late policy reasonable? Are there diversity implications for smart late policies?

Robin DeRosa had an interesting tweet about late policies recently, and I posted my late policy in reply. Here’s a slightly expanded version:

In most of my classes, late work happens because students are really busy, not because they’re slackers. That means a late policy with percentage deductions kind of sucks, because my students will also be really busy the next week. Instead, I combine “no late work accepted” with dropping the equivalent of one week’s worth of each assignment time. E.g. in a class that meets three times a week, I throw out three of the daily assignments.

I make sure to frame this in a discussion with the students, where I explain that the policy is an explicit recognition of the fact that they’re busy. If you’re too busy to get the work done on time, JUST SKIP it, and get your life caught up.

So far, it has been working out really well. The students appreciate the extra lever for managing their schedules, and it’s clear from the beginning that there won’t need to be any exceptions. Note: every semester so far, students have managed to get confused early on … luckily, this comes up in terms of one of those low-weight daily assignments, so we clear it up before a high-stakes situation shows up).

I know this won’t work well for everyone, especially people whose course materials are quite different from mine. I’ve been able to use it well in both intro and upper-level classes, though. It has been easy to adapt to group work as well: groups always get to evaluate each other as part of the process, so they’re in charge of some of the enforcement.

I know other people who have similar policies, but with more bookkeeping. For example, people hand out “extensions” or “free passes” at the beginning of the term. It rewards organization, etc. I don’t think that serves my policy as well. For one thing, you have to have a good bookkeeping system; if you hand out literal extension passes, you have to either make sure students aren’t sharing them with each other, or be OK with that economy. For another thing, I like the raw simplicity of my policy, and I think that it might help in terms of normalizing things between students. I could be wrong!

After that twitter thread, I have some questions:

Are there diversity implications for smart late policies? One of the obvious things I know about supporting diverse groups of students is that you need to recognize the fact that there’s a whole world out there affecting in-class and out-of class performance. One of my hopes with my policy is that it’s explicitly good in this regard, but I’d love feedback here.

Should I add tiered deadlines? Matthew Cheney has a nice policy, where he distinguishes between “hard” and “soft” deadlines, explicitly telling the students which is which, and making that part of the late policy for grading. That seems honest and good, but I like the raw simplicity of my approach, so I’m nervous about adding more to it.

How does this affect colleagues? My department is young, so maybe this is a particular issue for us, but some of us tend to get more pushback from students than others (instead of young, you could probably insert all sorts of gender dynamics, etc. here). In several of our classes, we standardized on my above policy as the departmental policy. Being able to refer to it as The Departmental Policy seems to help those colleagues out.

Comments from Old Blog

3 Responses to Can I be smarter about late policies?

  1. Kat Bartlow says:

    Interesting take. In my classes, I usually have one set of low-stakes assignments (homework, quizzes on Moodle, etc.) My policy on this is to not accept late work, but to only require students to turn in 10 of 15 possible assignments. On the other hand, for major assignments like papers, I do deduct percentage points per day of lateness unless the student makes a case for an extension before the paper deadline. I think there's value in recognizing student busy-ness and not driving them to extremes (particularly on your pressure-cooker campus), but there's also value in reminding them that in order to achieve your goals, you have to set priorities and even (gasp!) work to deadlines.

Leave a Reply

Logged in as mglerner. Log out?

Optimizing MSD calculations

[twitter-follow username="mglerner" scheme="dark"]

This whole post is available as an IPython Notebook here.

Writing a faster mean-squared-displacement function

I'm trying to calculate the Mean Squared Displacement (MSD) of a particle trajectory. In reality, I have an array of positions for numtrj particles and numpts timepoints and dim dimensions: pos[numtrj, numpts, dim]. I think my question has the same answer if I just have the x trajectory of a single particle, though.

In case you haven't done MSD calculations before, there's one cute way in which you get extra information out of a trajectory. Say I have just five time points, and my positions are

In [83]: x

Out[83]: array([ 0.        ,  1.74528704,  1.59639865,  2.59976219,  3.70852457])

You could just get squared displacement by looking at x**2. However, you could also say that you have 4 different values for the displacement at one timestep:

x[1] - x[0], x[2] - x[1], x[3] - x[2], x[4] - x[3]

Similarly, three values for displacement at two timesteps: x[2:] - x[:-2]. Etc. So, the way I'm calculating MSD at the moment is:

def msd_1d(x):
    result = np.zeros_like(x)
    for i in range(1,len(x)):
        result[i] = np.average((x[i:] - x[:-i])**2)
    return result

or

def get_msd_traj(pos):
    result = np.zeros_like(pos)
    for i in range(1,pos.shape[1]):
        result[:,i,:] = np.average((pos[:,i:,:] - pos[:,:-i,:])**2,axis=1)
    return result

(side note: often the data comes indexed like pos[numpts, numtrj, dim] for molecular dynamics trajectories, but that doesn't change anything here)

So, I asked Joshua Adelman if he had any quick thoughts.

Josh cleared up some basic misconceptions for me and gave a couple of suggestions:

From Josh

Hi Michael,

In general, numba is not going to speed up calculations that involve numpy built-in methods (unless they are things like np.exp, np.sin, np.cos) or array slicing. It works best if you have unrolled any vectorization into explicited nested loops and operate on arrays element-by-element. If you go that route, check to see if you can get the code to compile using @jit(nopython=True), which would be indicitive of numba being able to translate the code to llvm IR without using python objects (which tend to slow things down). I've also had a lot of success lately parallelizing numba code using threading if you aren't dealing with any python objects. If you can compile in nopython mode, then you can also specify:

@jit(nopython=True, nogil=True)

and use a ThreadPool to chop up the work, using something like:

http://numba.pydata.org/numba-doc/0.19.1/user/examples.html?highlight=nogil#multi-threading

it's undocumented, but for simple stuff you can use:

from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=nthreads)
results = pool.map(func, arg)

It has the same API as the multiprocessing pool, and for embarassingly parallel tasks that don't have possible race conditions, it works quite nicely depending on the overhead involved in spawning off the task relative to the cost of the task.

Also, make sure you're using the latest version of numba since it's getting better rapidly. I'm not sure if you'll be able to call np.zeros_like in a numba-ized method in nopython mode, although you might. If not, you can pass in the results array as an argument.

Hope that helps. I can't think of any sneaky vectorization tricks in pure-numpy off the top of my head to make your calculation faster. Let me know if you have any other questions.

Josh

Back to me

So, I tried several things. You have to do some fiddling around to get nopython=True to work. In addition to not knowing about np.average, things like x[1:] - x[:-1] are doomed to failure. However, that last part isn't really trouble, because one of Josh's points was that I should write out all of my loops explicitly. Here are two versions of the 1D version:

In [1]:
import numpy as np
from numba import jit
In [2]:
def msd_1d(x):
    result = np.zeros_like(x)
    for i in range(1,len(x)):
        result[i] = np.average((x[i:] - x[:-i])**2)
    return result

@jit(nopython = True)
def msd_1d_nb1(x):
    result = np.zeros_like(x)
    for delta in range(1,len(x)):
        thisresult = 0
        for i in range(delta,len(x)):
            thisresult += (x[i] - x[i-delta])**2
        result[delta] = thisresult / (len(x) - delta)
    return result
    
@jit(nopython = True)
def msd_1d_nb2(x):
    result = np.zeros_like(x)
    for delta in range(1,len(x)):
        for i in range(delta,len(x)):
            result[delta] += (x[i] - x[i-delta])**2
        result[delta] = result[delta] / (len(x) - delta)
    return result

Note that the above definitions will work almost no matter what you do. You don't get the numba issues until you try to actually run the functions. Here are some timings:

In [3]:
%timeit msd_1d(np.random.randn(10))
%timeit msd_1d_nb1(np.random.randn(10))
%timeit msd_1d_nb2(np.random.randn(10))
The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 114 µs per loop
The slowest run took 11972.39 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 7.02 µs per loop
The slowest run took 14872.73 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 6.94 µs per loop
In [4]:
%timeit msd_1d(np.random.randn(100))
%timeit msd_1d_nb1(np.random.randn(100))
%timeit msd_1d_nb2(np.random.randn(100))
1000 loops, best of 3: 1.19 ms per loop
100000 loops, best of 3: 15.8 µs per loop
100000 loops, best of 3: 18.5 µs per loop
In [5]:
%timeit msd_1d(np.random.randn(1000))
%timeit msd_1d_nb1(np.random.randn(1000))
%timeit msd_1d_nb2(np.random.randn(1000))
100 loops, best of 3: 13.3 ms per loop
1000 loops, best of 3: 540 µs per loop
1000 loops, best of 3: 594 µs per loop

So, pretty big win there! Approximately two orders of magnitude. It looks like it scales well too.

The larger version

So, it looks like there's not much of a difference between version 1 and 2. But it does look like version 1 is a bit faster.

In [6]:
def get_msd_traj(pos):
    result = np.zeros_like(pos)
    for i in range(1,pos.shape[1]):
        result[:,i,:] = np.average((pos[:,i:,:] - pos[:,:-i,:])**2,axis=1)
    return result

@jit(nopython = True)
def get_msd_traj_nb1(pos):
    result = np.zeros_like(pos)
    deltastop = pos.shape[1]
    for traj in range(pos.shape[0]):
        for dim in range(pos.shape[2]):
            for delta in range(1,deltastop):
                thisresult = 0
                for i in range(delta,deltastop):
                    thisresult += (pos[traj,i,dim] - pos[traj,i-delta,dim])**2
                result[traj,delta,dim] = thisresult / (deltastop - delta)
    return result

First, let's do some due diligence and make sure the results are equivalent. Then, timings.

In [7]:
a = np.random.randn(5,3,2)
np.all(get_msd_traj(a) == get_msd_traj_nb1(a))
Out[7]:
True
In [8]:
%timeit get_msd_traj(np.random.randn(10,10,2))
%timeit get_msd_traj_nb1(np.random.randn(10,10,2))
1000 loops, best of 3: 200 µs per loop
100000 loops, best of 3: 14.6 µs per loop
In [9]:
%timeit get_msd_traj(np.random.randn(100,100,2))
%timeit get_msd_traj_nb1(np.random.randn(100,100,2))
10 loops, best of 3: 12.3 ms per loop
1000 loops, best of 3: 1.84 ms per loop
In [10]:
%timeit get_msd_traj(np.random.randn(512,2001,2))
%timeit get_msd_traj_nb1(np.random.randn(512,2001,2))
1 loops, best of 3: 27.8 s per loop
1 loops, best of 3: 2.47 s per loop

So, for the actual data sizes I care about, I'm probably down to "only" an order of magnitude speedup. That's still pretty awesome.

Comments from Old blog

One Response to Using numba to speed up mean-squared displacement calculations

  1. mglerner says:

    On twitter, Konrad Hinsen pointed out that I could get a bigger speedup (likely a factor of 1000, but it's NlogN instead of N^2, so it only gets better) by using a better algorithm. Several years ago, I did try to use an FFT-based method. I was sketched out by the fact that the results I obtained were similar to the exact results, but not identical. It sounds likely that his method is just better. The paper claims it's exact, and a followup conversation indicates that they get better accuracy by doing fewer calculations (thus less accumulated error). Filed away for next time!

Leave a Reply

Logged in as mglerner. Log out?

MMMC2015

[twitter-follow username="mglerner" scheme="dark"]

This whole post is available as an IPython Notebook here.

MMMC the 2015 update

See the original blog post for details and history. Here's the short story: in my Statistical and Thermal Physics class, we want to use Monte Carlo simulations to generate brackets for March Madness. There are at least two obvious ways to go about this:

  1. Make some function that tells us the chance that team A beats team B, then flip coins for each matchup. That gets you one bracket. Repeat 100,000 times, collect statistics. This is the way Nate Silver's 538.com handles simulations for basketball, elections, etc, and I should probably implement it (note to self/motivated students: it's as easy as just generating 100,000 new brackets at a given temperature).

  2. Generate one bracket, then do a Monte-Carlo walk through bracket space. This is tougher. We have to figure out how to make a move in bracket space, which is part of the fun of Monte Carlo simulations in general. To see how this is done, check out the code in Bracket.swap and Brackets.simulate.

As you can tell, we take option 2 above. I've made things a bit nicer from a user standpoint this year; here's a walkthrough. First, load up our standard IPython setup

In [1]:
from __future__ import division
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
import numpy as np
from matplotlib import pyplot as plt
from IPython.display import HTML

The basics

make me a single bracket at a given temperature

In [2]:
import MarchMadnessMonteCarlo as MMMC
teams = MMMC.teams['south']
b = MMMC.Bracket(teams=teams,T=0.5)
print b
Duke (1)                                                     
Robert Morris (16)        Duk (1)                            
San Diego St. (8)                                            
St. John's (9)            San (8)  Duk (1)                   
Utah (5)                                                     
Stephen F. Austin (12)    Ste (12)                           
Georgetown (4)                                               
Eastern Washington (13)   Geo (4)  Ste (12) Ste (12)         
SMU (6)                   UCL (11) UCL (11) UCL (11) Ste (12)
UCLA (11)                                                    
Iowa St. (3)              Iow (3)                            
UAB (14)                                                     
Iowa (7)                  Iow (7)  Iow (7)                   
Davidson (10)                                                
Gonzaga (2)               Gon (2)                            
North Dakota St. (15)                                        
Total bracket energy: -17.4533174222

Now, instead of asking for all of the teams in the South, make up a set of Final Four teams, and run 1000 simulations. Show some statistics

In [3]:
sr = MMMC.simulate(1000,['Kentucky','Wisconsin','Villanova','Duke'],0.5)
MMMC.showstats(sr,newfig=True)
Lowest energy bracket
Kentucky (1)                               
Wisconsin (1)             Ken (1)          
Villanova (1)             Vil (1)  Ken (1) 
Duke (1)                                   
Total bracket energy: -3.05919025906

Most common bracket (159)
Kentucky (1)                               
Wisconsin (1)             Ken (1)          
Villanova (1)             Duk (1)  Ken (1) 
Duke (1)                                   
Total bracket energy: -3.04122415391

As you can see, we fully sample bracket space pretty quickly (look at the graph in the top right, with the unique brackets shown).

What should our temperature be?

If we had chosen option 1 at the top, we'd just flip coins with a given probability of winning. Here (and this may be a questionable decision), we need to set an overall temperature for our simulation. Intuitively, the higher the temperature, the closer we come to a random outcome. The lower the temperature, the closer we come to a "best seed always wins" bracket. If we're going to make sense of temperature, we should pick a reasonable energy function.

We can use KenPom's log5 defined as

def log5_energy_game(winner, loser):
    A,B = strength[winner],strength[loser]
    # see http://207.56.97.150/articles/playoff2002.htm
    win_pct = (A-A*B)/(A+B-2*A*B)
    return -win_pct

Conveniently, that's coded up for you in MarchMadnessMonteCarlo.examples.

Later on, we could make a fancy energy function with, e.g., a weighted average of KenPom, Jeff Sagarin, and the NCAA rankings.

In [4]:
import MarchMadnessMonteCarlo.examples
MMMC.set_energy_function(MarchMadnessMonteCarlo.examples.log5_energy_game)

Now, what should our actual temperature be? Historically, we know that an 8 seed vs. a 9 seed should essentially be a tossup. So, as a proxy here, we could just look at the chance of an 8 seed winning over a range of temperatures, and pick the point where it's pretty close to 0.5.

In [5]:
def winpct8(team8,team9,T,numtrials=10000):
    results = [MMMC.playgame(team8,team9,T)[0] == team8 for i in range(numtrials)]
    return np.average(results)
def plotwins(team8,team9,numtrials=10000):
    Ts = np.linspace(0,3,100)
    pct = [winpct8(team8,team9,T,numtrials) for T in Ts]
    plt.plot(Ts,pct,label='{t1} vs. {t2}'.format(t1=team8,t2=team9))
    plt.xlabel('T')
    plt.ylabel('winpct')

We'll look at it for all four of the 8 vs. 9 matchups

In [6]:
plotwins('Cincinnati','Purdue')
plotwins('Oregon','Oklahoma St.')
plotwins('North Carolina St.','LSU')
plotwins('San Diego St.',"St. John's")
plt.legend()
Out[6]:
<matplotlib.legend.Legend at 0x7fc7e94c7650>

A couple of things jump out. First, the green line: KenPom has Oregon ranked below Oklahoma State, so the curve approaches 50% from below instead of from above.

Second, just eyeballing that, it looks like we could make a legitimate argument for almost anything between T=1.0 and 2.0.

What does that do to "clear" matchups? Well, Kentucky is a big favorite over Kansas with this model:

In [7]:
MarchMadnessMonteCarlo.examples.log5_energy_game('Kentucky','Kansas')
Out[7]:
-0.8176307825952608
In [8]:
plotwins('Kentucky','Kansas')

If I pick T=1.5, I get Kentucky, the favorite, winning 60% of the time. You can feel free to choose a "saner" temperature, but I'm rooting for KU here!

Running some brackets!

So what does a bracket happen to look like? Well, in the original blog post, I mentioned that we came up with two different ways to run brackets. It can take a while to run a bracket, so a fast way (implemented as runbracket2) is to run separate brackets for each of the individual regions, then take the winner from each region's most common bracket to form a Final Four. Below, we do just that, running 10,000 trials for each of the regions, and 1000 trials for the Final Four.

In [9]:
results = MMMC.runbracket2(ntrials1=10000,ntrials2=1000,T=1.5)
YOUR LOWEST ENERGY BRACKETS
LOWEST ENERGY BRACKET FOR REGION midwest
Kentucky (1)                                                 
Hampton (16)              Ken (1)                            
Cincinnati (8)                                               
Purdue (9)                Cin (8)  Ken (1)                   
West Virginia (5)                                            
Buffalo (12)              Wes (5)                            
Maryland (4)                                                 
Valparaiso (13)           Mar (4)  Wes (5)  Ken (1)          
Butler (6)                Tex (11) Not (3)  Not (3)  Ken (1) 
Texas (11)                                                   
Notre Dame (3)            Not (3)                            
Northeastern (14)                                            
Wichita St. (7)           Wic (7)  Kan (2)                   
Indiana (10)                                                 
Kansas (2)                Kan (2)                            
New Mexico St. (15)                                          
Total bracket energy: -10.5524230049


LOWEST ENERGY BRACKET FOR REGION west
Wisconsin (1)                                                
Coastal Carolina (16)     Wis (1)                            
Oregon (8)                                                   
Oklahoma St. (9)          Okl (9)  Wis (1)                   
Arkansas (5)                                                 
Wofford (12)              Ark (5)                            
North Carolina (4)                                           
Harvard (13)              Nor (4)  Nor (4)  Wis (1)          
Xavier (6)                Xav (6)  Bay (3)  Ari (2)  Ari (2) 
Mississippi (11)                                             
Baylor (3)                Bay (3)                            
Georgia St. (14)                                             
VCU (7)                   Ohi (10) Ari (2)                   
Ohio St. (10)                                                
Arizona (2)               Ari (2)                            
Texas Southern (15)                                          
Total bracket energy: -10.8592988759


LOWEST ENERGY BRACKET FOR REGION south
Duke (1)                                                     
Robert Morris (16)        Duk (1)                            
San Diego St. (8)                                            
St. John's (9)            San (8)  Duk (1)                   
Utah (5)                                                     
Stephen F. Austin (12)    Uta (5)                            
Georgetown (4)                                               
Eastern Washington (13)   Geo (4)  Uta (5)  Duk (1)          
SMU (6)                   SMU (6)  Iow (3)  Gon (2)  Gon (2) 
UCLA (11)                                                    
Iowa St. (3)              Iow (3)                            
UAB (14)                                                     
Iowa (7)                  Iow (7)  Gon (2)                   
Davidson (10)                                                
Gonzaga (2)               Gon (2)                            
North Dakota St. (15)                                        
Total bracket energy: -10.5015086048


LOWEST ENERGY BRACKET FOR REGION east
Villanova (1)                                                
Lafayette (16)            Vil (1)                            
North Carolina St. (8)                                       
LSU (9)                   Nor (8)  Vil (1)                   
Northern Iowa (5)                                            
Wyoming (12)              Nor (5)                            
Louisville (4)                                               
UC Irvine (13)            Lou (4)  Nor (5)  Vil (1)          
Providence (6)            Pro (6)  Okl (3)  Vir (2)  Vir (2) 
Dayton (11)                                                  
Oklahoma (3)              Okl (3)                            
Albany (14)                                                  
Michigan St. (7)          Mic (7)  Vir (2)                   
Georgia (10)                                                 
Virginia (2)              Vir (2)                            
Belmont (15)                                                 
Total bracket energy: -10.8839299775


LOWEST ENERGY BRACKET FOR FINAL FOUR
Kentucky (1)                               
Wisconsin (1)             Ken (1)          
Utah (5)                  Vir (2)  Ken (1) 
Virginia (2)                               
Total bracket energy: -1.95692907302

YOUR MOST COMMON BRACKETS
MOST COMMON BRACKET FOR REGION midwest
Kentucky (1)                                                 
Hampton (16)              Ken (1)                            
Cincinnati (8)                                               
Purdue (9)                Pur (9)  Ken (1)                   
West Virginia (5)                                            
Buffalo (12)              Wes (5)                            
Maryland (4)                                                 
Valparaiso (13)           Mar (4)  Mar (4)  Ken (1)          
Butler (6)                Tex (11) Not (3)  Kan (2)  Ken (1) 
Texas (11)                                                   
Notre Dame (3)            Not (3)                            
Northeastern (14)                                            
Wichita St. (7)           Wic (7)  Kan (2)                   
Indiana (10)                                                 
Kansas (2)                Kan (2)                            
New Mexico St. (15)                                          
Total bracket energy: -10.3577160931

number of times this bracket happened: 11


MOST COMMON BRACKET FOR REGION west
Wisconsin (1)                                                
Coastal Carolina (16)     Wis (1)                            
Oregon (8)                                                   
Oklahoma St. (9)          Ore (8)  Wis (1)                   
Arkansas (5)                                                 
Wofford (12)              Ark (5)                            
North Carolina (4)                                           
Harvard (13)              Nor (4)  Nor (4)  Wis (1)          
Xavier (6)                Mis (11) Mis (11) Ohi (10) Wis (1) 
Mississippi (11)                                             
Baylor (3)                Geo (14)                           
Georgia St. (14)                                             
VCU (7)                   Ohi (10) Ohi (10)                  
Ohio St. (10)                                                
Arizona (2)               Tex (15)                           
Texas Southern (15)                                          
Total bracket energy: -9.30614882152

number of times this bracket happened: 13


MOST COMMON BRACKET FOR REGION south
Duke (1)                                                     
Robert Morris (16)        Duk (1)                            
San Diego St. (8)                                            
St. John's (9)            St. (9)  Duk (1)                   
Utah (5)                                                     
Stephen F. Austin (12)    Uta (5)                            
Georgetown (4)                                               
Eastern Washington (13)   Geo (4)  Uta (5)  Uta (5)          
SMU (6)                   UCL (11) UCL (11) Gon (2)  Uta (5) 
UCLA (11)                                                    
Iowa St. (3)              Iow (3)                            
UAB (14)                                                     
Iowa (7)                  Dav (10) Gon (2)                   
Davidson (10)                                                
Gonzaga (2)               Gon (2)                            
North Dakota St. (15)                                        
Total bracket energy: -9.78926932204

number of times this bracket happened: 9


MOST COMMON BRACKET FOR REGION east
Villanova (1)                                                
Lafayette (16)            Vil (1)                            
North Carolina St. (8)                                       
LSU (9)                   LSU (9)  Vil (1)                   
Northern Iowa (5)                                            
Wyoming (12)              Nor (5)                            
Louisville (4)                                               
UC Irvine (13)            Lou (4)  Lou (4)  Vil (1)          
Providence (6)            Day (11) Okl (3)  Vir (2)  Vir (2) 
Dayton (11)                                                  
Oklahoma (3)              Okl (3)                            
Albany (14)                                                  
Michigan St. (7)          Geo (10) Vir (2)                   
Georgia (10)                                                 
Virginia (2)              Vir (2)                            
Belmont (15)                                                 
Total bracket energy: -10.5204279257

number of times this bracket happened: 11


MOST COMMON BRACKET FOR FINAL FOUR
Kentucky (1)                               
Wisconsin (1)             Ken (1)          
Utah (5)                  Uta (5)  Ken (1) 
Virginia (2)                               
Total bracket energy: -1.78538460124

number of times this bracket happened: 176

So, not looking so hot for my Kansas, but the code is working.

Let's visualize the results, then make a table of results, as per 538.

In [10]:
# Basic visualization
for region in 'midwest west east south'.split():
    MMMC.showstats(results[region],description=region,newfig=True)

Here you can see the classic reason to do Monte Carlo samping in the first place: we're absolutely nowhere near fully sampling bracket space. You can tell this from each of the "Unique brackets over the trajectory" plots, none of which have leveled out.

Further, you can see that the strength distribution in each of the regions is pretty different. Check out the energy distribution histograms for visual proof.

Nice tables

And here's our set of tabulated results. Note that things look a little funny for the Final Four: runbracket2 runs the final four separate from the rest of the tourney. So, only four teams are allowed to have Championship and Win percentages above zero (the final four percentages are taken from the first chunk of the tourney).

In [11]:
h = HTML(MMMC.maketable(results))
h
Out[11]:
Team Region Rank 2nd Round 3rd Round Sweet 16 Elite 8 Final 4 Championship Win
Kentucky midwest 1 100.0 65.74 49.18 36.16 25.29 55.3 34.5
Wisconsin west 1 100.0 63.38 44.28 30.04 19.44 44.7 24.2
Virginia east 2 100.0 64.6 41.15 27.68 17.86 51.3 22.4
Utah south 5 100.0 58.29 35.86 20.96 12.19 48.7 18.9
Arizona west 2 100.0 62.35 42.23 28.64 18.74 0.0 0.0
Villanova east 1 100.0 65.73 44.05 29.84 18.61 0.0 0.0
Duke south 1 100.0 66.03 41.64 25.2 15.05 0.0 0.0
Gonzaga south 2 100.0 61.67 37.12 23.51 14.22 0.0 0.0
Kansas midwest 2 100.0 58.6 33.71 19.16 10.31 0.0 0.0
Oklahoma east 3 100.0 60.74 34.96 18.34 10.01 0.0 0.0
Northern Iowa east 5 100.0 65.23 37.58 19.72 9.97 0.0 0.0
Notre Dame midwest 3 100.0 57.37 32.47 18.57 9.65 0.0 0.0
Iowa St. south 3 100.0 57.01 33.22 17.32 9.07 0.0 0.0
Baylor west 3 100.0 56.3 33.23 16.77 8.79 0.0 0.0
North Carolina west 4 100.0 61.15 35.63 17.73 8.7 0.0 0.0
SMU south 6 100.0 53.2 28.16 14.61 7.62 0.0 0.0
Wichita St. midwest 7 100.0 51.59 28.31 15.41 7.54 0.0 0.0
Georgetown south 4 100.0 60.99 28.74 15.08 7.37 0.0 0.0
Michigan St. east 7 100.0 55.17 25.73 14.43 7.17 0.0 0.0
Iowa south 7 100.0 51.15 25.99 13.76 6.63 0.0 0.0
Louisville east 4 100.0 56.62 29.3 14.05 6.58 0.0 0.0
West Virginia midwest 5 100.0 55.77 31.54 13.3 6.47 0.0 0.0
Providence east 6 100.0 53.03 27.3 12.58 6.19 0.0 0.0
Butler midwest 6 100.0 53.61 27.03 13.43 6.05 0.0 0.0
Texas midwest 11 100.0 46.39 23.9 12.46 5.99 0.0 0.0
Xavier west 6 100.0 50.71 25.68 12.46 5.95 0.0 0.0
Ohio St. west 10 100.0 51.62 22.26 11.19 5.29 0.0 0.0
Arkansas west 5 100.0 57.67 28.22 11.83 5.14 0.0 0.0
Maryland midwest 4 100.0 55.25 25.84 11.73 5.14 0.0 0.0
VCU west 7 100.0 48.38 21.78 11.24 5.06 0.0 0.0
Oklahoma St. west 9 100.0 50.19 21.58 11.31 4.84 0.0 0.0
San Diego St. south 8 100.0 47.92 23.04 10.71 4.81 0.0 0.0
Oregon west 8 100.0 49.81 20.61 10.42 4.73 0.0 0.0
Cincinnati midwest 8 100.0 49.74 19.41 9.93 4.72 0.0 0.0
Davidson south 10 100.0 48.85 23.46 10.7 4.54 0.0 0.0
LSU east 9 100.0 53.31 24.25 11.58 4.53 0.0 0.0
North Carolina St. east 8 100.0 46.69 21.93 10.18 4.35 0.0 0.0
Stephen F. Austin south 12 100.0 41.71 21.31 9.08 4.3 0.0 0.0
UCLA south 11 100.0 46.8 21.71 9.51 4.22 0.0 0.0
St. John's south 9 100.0 52.08 22.83 9.84 4.17 0.0 0.0
Purdue midwest 9 100.0 50.26 19.41 8.86 3.76 0.0 0.0
Buffalo midwest 12 100.0 44.23 23.13 8.76 3.55 0.0 0.0
Georgia east 10 100.0 44.83 19.41 8.76 3.53 0.0 0.0
Dayton east 11 100.0 46.97 22.59 8.55 3.52 0.0 0.0
Mississippi west 11 100.0 49.29 21.45 9.01 3.51 0.0 0.0
Indiana midwest 10 100.0 48.41 22.29 8.57 3.14 0.0 0.0
Valparaiso midwest 13 100.0 44.75 19.49 7.86 3.1 0.0 0.0
Georgia St. west 14 100.0 43.7 19.64 6.96 2.67 0.0 0.0
Harvard west 13 100.0 38.85 18.65 6.98 2.36 0.0 0.0
Wofford west 12 100.0 42.33 17.5 6.84 2.33 0.0 0.0
New Mexico St. midwest 15 100.0 41.4 15.69 6.3 2.26 0.0 0.0
UC Irvine east 13 100.0 43.38 19.57 6.64 2.17 0.0 0.0
Northeastern midwest 14 100.0 42.63 16.6 6.1 2.17 0.0 0.0
UAB south 14 100.0 42.99 16.91 6.09 2.02 0.0 0.0
Wyoming east 12 100.0 34.77 13.55 5.06 1.8 0.0 0.0
Albany east 14 100.0 39.26 15.15 4.86 1.59 0.0 0.0
Eastern Washington south 13 100.0 39.01 14.09 5.06 1.42 0.0 0.0
Coastal Carolina west 16 100.0 36.62 13.53 4.85 1.42 0.0 0.0
North Dakota St. south 15 100.0 38.33 13.43 4.5 1.35 0.0 0.0
Belmont east 15 100.0 35.4 13.71 4.8 1.28 0.0 0.0
Texas Southern west 15 100.0 37.65 13.73 3.73 1.03 0.0 0.0
Robert Morris south 16 100.0 33.97 12.49 4.07 1.02 0.0 0.0
Hampton midwest 16 100.0 34.26 12.0 3.4 0.86 0.0 0.0
Lafayette east 16 100.0 34.27 9.77 2.93 0.84 0.0 0.0

Given those concerns, let's just run runbracket1 and look at the results

In [12]:
results = MMMC.runbracket1(ntrials=10000,T=1.5)
Lowest energy bracket
Kentucky (1)                                                                   
Hampton (16)              Ken (1)                                              
Cincinnati (8)                                                                 
Purdue (9)                Cin (8)  Ken (1)                                     
West Virginia (5)                                                              
Buffalo (12)              Wes (5)                                              
Maryland (4)                                                                   
Valparaiso (13)           Mar (4)  Wes (5)  Ken (1)                            
Butler (6)                                                                     
Texas (11)                Tex (11)                                             
Notre Dame (3)                                                                 
Northeastern (14)         Not (3)  Not (3)                                     
Wichita St. (7)                                                                
Indiana (10)              Wic (7)                                              
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Kan (2)  Not (3)  Ken (1)                   
Duke (1)                                                                       
Robert Morris (16)        Duk (1)                                              
San Diego St. (8)                                                              
St. John's (9)            San (8)  Duk (1)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Uta (5)                                              
Georgetown (4)                                                                 
Eastern Washington (13)   Geo (4)  Uta (5)  Duk (1)                            
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  Iow (3)  Iow (3)                                     
Iowa (7)                                                                       
Davidson (10)             Iow (7)                                              
Gonzaga (2)                                                                    
North Dakota St. (15)     Gon (2)  Gon (2)  Gon (2)  Gon (2)  Ken (1)          
Wisconsin (1)             Wis (1)  Wis (1)  Wis (1)  Ari (2)  Ari (2)  Ken (1) 
Coastal Carolina (16)                                                          
Oregon (8)                Okl (9)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Ark (5)  Nor (4)                                     
Wofford (12)                                                                   
North Carolina (4)        Nor (4)                                              
Harvard (13)                                                                   
Xavier (6)                Xav (6)  Bay (3)  Ari (2)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   Ohi (10) Ari (2)                                     
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Vil (1)  Vil (1)  Vil (1)  Vir (2)                   
Lafayette (16)                                                                 
North Carolina St. (8)    Nor (8)                                              
LSU (9)                                                                        
Northern Iowa (5)         Nor (5)  Nor (5)                                     
Wyoming (12)                                                                   
Louisville (4)            Lou (4)                                              
UC Irvine (13)                                                                 
Providence (6)            Pro (6)  Okl (3)  Vir (2)                            
Dayton (11)                                                                    
Oklahoma (3)              Okl (3)                                              
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Vir (2)                                     
Georgia (10)                                                                   
Virginia (2)              Vir (2)                                              
Belmont (15)                                                                   
Total bracket energy: -44.697443432

Most common bracket (7)
Kentucky (1)                                                                   
Hampton (16)              Ham (16)                                             
Cincinnati (8)                                                                 
Purdue (9)                Cin (8)  Cin (8)                                     
West Virginia (5)                                                              
Buffalo (12)              Wes (5)                                              
Maryland (4)                                                                   
Valparaiso (13)           Mar (4)  Wes (5)  Cin (8)                            
Butler (6)                                                                     
Texas (11)                But (6)                                              
Notre Dame (3)                                                                 
Northeastern (14)         Nor (14) But (6)                                     
Wichita St. (7)                                                                
Indiana (10)              Wic (7)                                              
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Kan (2)  But (6)  But (6)                   
Duke (1)                                                                       
Robert Morris (16)        Rob (16)                                             
San Diego St. (8)                                                              
St. John's (9)            San (8)  San (8)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Uta (5)                                              
Georgetown (4)                                                                 
Eastern Washington (13)   Geo (4)  Geo (4)  Geo (4)                            
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  Iow (3)  Iow (3)                                     
Iowa (7)                                                                       
Davidson (10)             Iow (7)                                              
Gonzaga (2)                                                                    
North Dakota St. (15)     Gon (2)  Gon (2)  Gon (2)  Gon (2)  Gon (2)          
Wisconsin (1)             Wis (1)  Ore (8)  Ore (8)  Ari (2)  Ari (2)  Ari (2) 
Coastal Carolina (16)                                                          
Oregon (8)                Ore (8)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Ark (5)  Nor (4)                                     
Wofford (12)                                                                   
North Carolina (4)        Nor (4)                                              
Harvard (13)                                                                   
Xavier (6)                Xav (6)  Xav (6)  Ari (2)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   Ohi (10) Ari (2)                                     
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Laf (16) Nor (8)  Nor (5)  Mic (7)                   
Lafayette (16)                                                                 
North Carolina St. (8)    Nor (8)                                              
LSU (9)                                                                        
Northern Iowa (5)         Nor (5)  Nor (5)                                     
Wyoming (12)                                                                   
Louisville (4)            UC  (13)                                             
UC Irvine (13)                                                                 
Providence (6)            Pro (6)  Pro (6)  Mic (7)                            
Dayton (11)                                                                    
Oklahoma (3)              Okl (3)                                              
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Mic (7)                                     
Georgia (10)                                                                   
Virginia (2)              Bel (15)                                             
Belmont (15)                                                                   
Total bracket energy: -38.100050778

In [13]:
h = HTML(MMMC.maketable(results))
h
Out[13]:
Team Region Rank 2nd Round 3rd Round Sweet 16 Elite 8 Final 4 Championship Win
Kentucky midwest 1 100.0 58.69 44.77 32.58 23.98 16.41 10.87
Villanova east 1 100.0 65.53 45.71 32.0 20.87 13.77 8.67
Arizona west 2 100.0 62.57 44.59 33.54 20.75 13.11 8.03
Wisconsin west 1 100.0 67.67 43.02 27.42 18.12 11.35 7.15
Virginia east 2 100.0 63.76 43.67 30.86 19.05 11.08 6.59
Gonzaga south 2 100.0 57.98 39.95 25.46 15.02 9.09 4.78
Utah south 5 100.0 57.92 34.4 21.04 13.19 6.8 3.66
Duke south 1 100.0 60.1 35.45 18.48 11.41 6.73 3.49
Notre Dame midwest 3 100.0 61.12 33.93 19.4 11.17 5.57 2.8
Oklahoma east 3 100.0 62.0 37.74 17.64 10.01 5.26 2.64
Kansas midwest 2 100.0 66.57 35.81 19.02 9.97 5.45 2.59
Northern Iowa east 5 100.0 62.27 35.11 18.67 10.05 4.82 2.38
Iowa St. south 3 100.0 60.06 36.95 20.04 11.12 5.24 2.36
Wichita St. midwest 7 100.0 56.94 31.91 19.06 8.76 4.18 2.17
North Carolina west 4 100.0 58.35 32.3 18.51 9.54 4.23 1.82
SMU south 6 100.0 55.35 29.26 16.91 8.61 3.79 1.61
Baylor west 3 100.0 56.63 29.51 11.82 7.12 3.63 1.56
Georgetown south 4 100.0 56.75 29.58 15.75 7.23 3.24 1.42
Texas midwest 11 100.0 58.28 27.89 14.8 6.87 3.3 1.39
Michigan St. east 7 100.0 59.67 27.43 14.51 6.41 2.98 1.29
Ohio St. west 10 100.0 50.4 23.8 12.61 6.46 3.19 1.27
West Virginia midwest 5 100.0 58.66 33.41 13.82 6.62 2.97 1.17
Arkansas west 5 100.0 57.66 31.16 12.71 5.34 2.38 1.16
Louisville east 4 100.0 55.49 31.49 15.28 7.18 3.1 1.14
Iowa south 7 100.0 55.68 25.06 9.92 5.49 2.67 1.13
Butler midwest 6 100.0 41.72 24.06 11.13 6.36 2.77 1.12
Xavier west 6 100.0 56.43 31.36 14.27 6.71 2.44 1.12
San Diego St. south 8 100.0 52.15 29.17 12.83 6.64 2.88 1.03
Cincinnati midwest 8 100.0 50.58 20.36 12.28 5.35 2.38 0.9
Davidson south 10 100.0 44.32 15.66 6.98 3.12 1.69 0.73
VCU west 7 100.0 49.6 19.03 8.63 4.28 1.81 0.71
Maryland midwest 4 100.0 50.87 22.22 9.44 3.6 1.86 0.7
Oregon west 8 100.0 58.41 24.89 10.74 4.78 1.74 0.66
Stephen F. Austin south 12 100.0 42.08 23.6 11.54 4.51 1.72 0.63
St. John's south 9 100.0 47.85 18.82 9.51 3.97 1.83 0.62
Dayton east 11 100.0 48.81 21.64 8.96 4.31 1.79 0.58
Mississippi west 11 100.0 43.57 21.86 8.38 3.0 1.41 0.55
Purdue midwest 9 100.0 49.42 20.37 9.7 3.31 1.33 0.53
Buffalo midwest 12 100.0 41.34 21.68 7.93 2.77 1.19 0.52
Valparaiso midwest 13 100.0 49.13 22.69 9.34 4.12 1.49 0.51
Providence east 6 100.0 51.19 26.68 11.18 4.28 1.4 0.48
LSU east 9 100.0 48.81 15.79 7.11 2.58 0.92 0.47
Georgia east 10 100.0 40.33 15.12 7.58 2.95 1.3 0.46
Oklahoma St. west 9 100.0 41.59 18.78 9.24 3.35 1.1 0.45
Wofford west 12 100.0 42.34 16.57 7.37 2.63 0.94 0.44
Indiana midwest 10 100.0 43.06 18.98 8.46 3.72 1.2 0.43
UCLA south 11 100.0 44.65 21.4 8.22 2.72 1.26 0.42
North Carolina St. east 8 100.0 51.19 22.31 8.99 3.55 1.45 0.4
Harvard west 13 100.0 41.65 19.97 8.53 2.49 0.8 0.35
UC Irvine east 13 100.0 44.51 20.52 7.19 2.44 0.97 0.26
Coastal Carolina west 16 100.0 32.33 13.31 5.48 2.24 0.57 0.24
Georgia St. west 14 100.0 43.37 17.27 6.04 1.92 0.47 0.21
UAB south 14 100.0 39.94 12.39 4.94 1.57 0.73 0.17
Eastern Washington south 13 100.0 43.25 12.42 3.88 1.27 0.55 0.17
Lafayette east 16 100.0 34.47 16.19 7.11 2.28 0.5 0.15
Wyoming east 12 100.0 37.73 12.88 3.65 1.12 0.4 0.15
Albany east 14 100.0 38.0 13.94 4.57 1.84 0.53 0.11
Robert Morris south 16 100.0 39.9 16.56 6.97 2.39 0.5 0.11
Belmont east 15 100.0 36.24 13.78 4.7 1.08 0.31 0.11
Texas Southern west 15 100.0 37.43 12.58 4.71 1.27 0.25 0.11
Northeastern midwest 14 100.0 38.88 14.12 4.28 0.99 0.29 0.09
Hampton midwest 16 100.0 41.31 14.5 4.91 1.25 0.25 0.07
North Dakota St. south 15 100.0 42.02 19.33 7.53 1.74 0.33 0.05
New Mexico St. midwest 15 100.0 33.43 13.3 3.85 1.16 0.31 0.05

That is, indeed, high temperature! Let's try it with a bunch more runs

In [14]:
results = MMMC.runbracket1(ntrials=100000,T=1.5)
Lowest energy bracket
Kentucky (1)                                                                   
Hampton (16)              Ken (1)                                              
Cincinnati (8)                                                                 
Purdue (9)                Cin (8)  Ken (1)                                     
West Virginia (5)                                                              
Buffalo (12)              Wes (5)                                              
Maryland (4)                                                                   
Valparaiso (13)           Mar (4)  Wes (5)  Ken (1)                            
Butler (6)                                                                     
Texas (11)                Tex (11)                                             
Notre Dame (3)                                                                 
Northeastern (14)         Not (3)  Not (3)                                     
Wichita St. (7)                                                                
Indiana (10)              Wic (7)                                              
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Kan (2)  Not (3)  Ken (1)                   
Duke (1)                                                                       
Robert Morris (16)        Duk (1)                                              
San Diego St. (8)                                                              
St. John's (9)            San (8)  Duk (1)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Uta (5)                                              
Georgetown (4)                                                                 
Eastern Washington (13)   Geo (4)  Uta (5)  Duk (1)                            
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  Iow (3)  Iow (3)                                     
Iowa (7)                                                                       
Davidson (10)             Iow (7)                                              
Gonzaga (2)                                                                    
North Dakota St. (15)     Gon (2)  Gon (2)  Gon (2)  Gon (2)  Ken (1)          
Wisconsin (1)             Wis (1)  Wis (1)  Wis (1)  Ari (2)  Ari (2)  Ken (1) 
Coastal Carolina (16)                                                          
Oregon (8)                Okl (9)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Ark (5)  Nor (4)                                     
Wofford (12)                                                                   
North Carolina (4)        Nor (4)                                              
Harvard (13)                                                                   
Xavier (6)                Xav (6)  Bay (3)  Ari (2)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   Ohi (10) Ari (2)                                     
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Vil (1)  Vil (1)  Vil (1)  Vir (2)                   
Lafayette (16)                                                                 
North Carolina St. (8)    Nor (8)                                              
LSU (9)                                                                        
Northern Iowa (5)         Nor (5)  Nor (5)                                     
Wyoming (12)                                                                   
Louisville (4)            Lou (4)                                              
UC Irvine (13)                                                                 
Providence (6)            Pro (6)  Okl (3)  Vir (2)                            
Dayton (11)                                                                    
Oklahoma (3)              Okl (3)                                              
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Vir (2)                                     
Georgia (10)                                                                   
Virginia (2)              Vir (2)                                              
Belmont (15)                                                                   
Total bracket energy: -44.697443432

Most common bracket (9)
Kentucky (1)                                                                   
Hampton (16)              Ken (1)                                              
Cincinnati (8)                                                                 
Purdue (9)                Cin (8)  Ken (1)                                     
West Virginia (5)                                                              
Buffalo (12)              Buf (12)                                             
Maryland (4)                                                                   
Valparaiso (13)           Val (13) Buf (12) Ken (1)                            
Butler (6)                                                                     
Texas (11)                Tex (11)                                             
Notre Dame (3)                                                                 
Northeastern (14)         Nor (14) Nor (14)                                    
Wichita St. (7)                                                                
Indiana (10)              Wic (7)                                              
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Wic (7)  Wic (7)  Ken (1)                   
Duke (1)                                                                       
Robert Morris (16)        Rob (16)                                             
San Diego St. (8)                                                              
St. John's (9)            St. (9)  St. (9)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Ste (12)                                             
Georgetown (4)                                                                 
Eastern Washington (13)   Eas (13) Eas (13) St. (9)                            
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  Iow (3)  Iow (3)                                     
Iowa (7)                                                                       
Davidson (10)             Dav (10)                                             
Gonzaga (2)                                                                    
North Dakota St. (15)     Nor (15) Dav (10) Iow (3)  Iow (3)  Ken (1)          
Wisconsin (1)             Wis (1)  Wis (1)  Ark (5)  Ark (5)  Vir (2)  Vir (2) 
Coastal Carolina (16)                                                          
Oregon (8)                Okl (9)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Ark (5)  Ark (5)                                     
Wofford (12)                                                                   
North Carolina (4)        Nor (4)                                              
Harvard (13)                                                                   
Xavier (6)                Xav (6)  Bay (3)  Bay (3)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   Ohi (10) Ohi (10)                                    
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Vil (1)  Vil (1)  Lou (4)  Vir (2)                   
Lafayette (16)                                                                 
North Carolina St. (8)    LSU (9)                                              
LSU (9)                                                                        
Northern Iowa (5)         Nor (5)  Lou (4)                                     
Wyoming (12)                                                                   
Louisville (4)            Lou (4)                                              
UC Irvine (13)                                                                 
Providence (6)            Day (11) Alb (14) Vir (2)                            
Dayton (11)                                                                    
Oklahoma (3)              Alb (14)                                             
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Vir (2)                                     
Georgia (10)                                                                   
Virginia (2)              Vir (2)                                              
Belmont (15)                                                                   
Total bracket energy: -37.1483683417

In [15]:
h = HTML(MMMC.maketable(results))
h
Out[15]:
Team Region Rank 2nd Round 3rd Round Sweet 16 Elite 8 Final 4 Championship Win
Kentucky midwest 1 100.0 67.848 49.189 36.444 26.79 18.949 12.828
Arizona west 2 100.0 64.114 42.628 29.248 18.838 12.256 7.805
Wisconsin west 1 100.0 63.368 42.146 28.131 17.705 11.083 6.862
Villanova east 1 100.0 64.314 43.814 28.618 17.425 10.64 6.365
Virginia east 2 100.0 62.189 39.522 26.285 17.038 10.321 6.183
Gonzaga south 2 100.0 61.672 38.305 24.17 14.417 8.149 4.538
Duke south 1 100.0 64.407 40.263 24.75 14.927 8.061 4.37
Oklahoma east 3 100.0 62.257 38.005 20.227 11.564 5.906 3.04
Utah south 5 100.0 56.188 34.295 19.469 10.846 5.576 2.832
Notre Dame midwest 3 100.0 60.692 33.504 18.74 9.665 5.061 2.473
Kansas midwest 2 100.0 59.732 33.553 18.682 9.098 4.834 2.329
Northern Iowa east 5 100.0 58.928 34.06 17.509 9.302 4.604 2.239
North Carolina west 4 100.0 61.148 36.852 19.086 9.535 4.682 2.223
Iowa St. south 3 100.0 59.581 33.405 18.131 9.562 4.647 2.206
Baylor west 3 100.0 58.599 34.549 16.969 9.177 4.63 2.202
Wichita St. midwest 7 100.0 56.45 31.005 17.065 8.454 4.378 2.104
Louisville east 4 100.0 56.525 30.188 15.331 6.889 3.158 1.413
Texas midwest 11 100.0 52.783 28.184 14.644 6.604 3.115 1.372
SMU south 6 100.0 54.37 29.068 14.489 6.789 3.097 1.361
Georgetown south 4 100.0 58.912 28.238 14.08 6.915 3.015 1.311
Michigan St. east 7 100.0 55.477 25.467 12.998 6.372 2.951 1.292
Ohio St. west 10 100.0 50.1 23.617 12.158 5.813 2.825 1.18
West Virginia midwest 5 100.0 52.507 28.241 12.365 5.829 2.809 1.09
Butler midwest 6 100.0 47.217 24.427 12.449 5.551 2.47 1.088
Iowa south 7 100.0 51.023 24.916 12.842 6.02 2.641 1.084
Arkansas west 5 100.0 58.71 28.683 13.046 5.712 2.619 1.074
San Diego St. south 8 100.0 52.821 25.262 12.281 5.721 2.585 1.026
Xavier west 6 100.0 52.833 26.252 11.254 5.424 2.318 0.966
VCU west 7 100.0 49.9 22.655 11.034 5.322 2.272 0.915
Davidson south 10 100.0 48.977 23.746 11.654 5.393 2.259 0.887
Maryland midwest 4 100.0 54.293 27.841 10.992 4.69 2.124 0.852
Cincinnati midwest 8 100.0 49.481 21.064 10.878 4.977 2.137 0.809
Providence east 6 100.0 50.52 23.839 10.379 4.86 1.981 0.778
North Carolina St. east 8 100.0 52.32 22.887 11.35 4.862 1.818 0.717
Oregon west 8 100.0 50.892 22.58 10.956 4.322 1.844 0.7
St. John's south 9 100.0 47.179 22.962 10.652 4.691 1.904 0.687
Stephen F. Austin south 12 100.0 43.812 22.666 9.837 4.551 1.859 0.674
Oklahoma St. west 9 100.0 49.108 21.697 10.965 4.484 1.736 0.666
Georgia east 10 100.0 44.523 21.647 10.237 4.523 1.723 0.65
Dayton east 11 100.0 49.48 22.922 9.937 4.078 1.609 0.594
LSU east 9 100.0 47.68 20.72 9.64 3.961 1.595 0.584
UCLA south 11 100.0 45.63 22.264 9.285 3.921 1.548 0.559
Mississippi west 11 100.0 47.167 20.522 8.975 3.684 1.45 0.53
Buffalo midwest 12 100.0 47.493 23.217 8.816 3.647 1.519 0.525
Purdue midwest 9 100.0 50.519 19.548 9.389 3.926 1.513 0.503
Indiana midwest 10 100.0 43.55 19.487 8.32 3.427 1.3 0.446
Valparaiso midwest 13 100.0 45.707 20.701 8.106 3.225 1.236 0.43
Georgia St. west 14 100.0 41.401 18.677 6.87 2.788 0.94 0.321
Harvard west 13 100.0 38.852 17.912 7.032 2.36 0.872 0.269
UC Irvine east 13 100.0 43.475 18.635 7.212 2.373 0.699 0.232
Wofford west 12 100.0 41.29 16.553 5.962 2.169 0.686 0.214
Wyoming east 12 100.0 41.072 17.117 6.21 2.154 0.697 0.194
New Mexico St. midwest 15 100.0 40.268 15.955 5.428 1.908 0.604 0.167
UAB south 14 100.0 40.419 15.263 4.757 1.649 0.545 0.153
Northeastern midwest 14 100.0 39.308 13.885 4.672 1.41 0.479 0.145
Belmont east 15 100.0 37.811 13.364 4.884 1.732 0.534 0.144
Albany east 14 100.0 37.743 15.234 5.053 1.602 0.465 0.14
Eastern Washington south 13 100.0 41.088 14.801 5.269 1.861 0.544 0.139
Coastal Carolina west 16 100.0 36.632 13.577 4.822 1.707 0.492 0.126
North Dakota St. south 15 100.0 38.328 13.033 4.672 1.602 0.478 0.121
Lafayette east 16 100.0 35.686 12.579 4.13 1.265 0.31 0.078
Robert Morris south 16 100.0 35.593 11.513 3.662 1.135 0.33 0.073
Texas Southern west 15 100.0 35.886 11.1 3.492 0.96 0.284 0.066
Hampton midwest 16 100.0 32.152 10.199 3.01 0.799 0.234 0.056
In [17]:
MMMC.showstats(results['all'],description='full run',newfig=True)

Not bad! (Also, note the shape of the energy distribution, and the fact taht the "Unique brackets" group is still basically linear!)

Who wins?

But let's be honest for a minute. We all know Kansas is going to win. What do the brackets where that happened look like?

In [18]:
goodbrackets = [i for i in results['all'].brackets if i.bracket[-1][0] == 'Kansas']
In [19]:
from MarchMadnessMonteCarlo.Brackets import Stats
lb, mcb, mcb_count, unique_brackets, lowest_sightings = Stats.gather_uniquestats(goodbrackets)
sr = MMMC.SimulationResults(goodbrackets,unique_brackets,lb,lowest_sightings,mcb,mcb_count)
trueresults = {'all':sr}
In [20]:
h = HTML(MMMC.maketable(trueresults))
h
Out[20]:
Team Region Rank 2nd Round 3rd Round Sweet 16 Elite 8 Final 4 Championship Win
Kansas midwest 2 100.0 100.0 100.0 100.0 100.0 100.0 100.0
Virginia east 2 100.0 61.743237441 39.7595534564 25.7621296694 17.1318162301 9.01674538429 0.0
Villanova east 1 100.0 68.05495921 47.8316874195 30.8286818377 17.3894375268 8.71618720481 0.0
Wisconsin west 1 100.0 62.0008587377 39.5019321597 24.9033920137 14.2121082009 8.2438814942 0.0
Oklahoma east 3 100.0 63.8042078145 42.7221983684 22.4130528124 13.0957492486 7.7715757836 0.0
Arizona west 2 100.0 62.9025332761 37.5697724345 24.4310863031 14.6414770288 7.72863890082 0.0
Northern Iowa east 5 100.0 59.5534564191 37.1404036067 18.3769858308 10.5195362817 5.45298411335 0.0
Baylor west 3 100.0 55.3456419064 30.9145556033 15.5431515672 9.05968226707 4.89480463718 0.0
North Carolina west 4 100.0 58.8664662945 35.4229282954 19.321597252 9.61786174324 4.8518677544 0.0
Michigan St. east 7 100.0 57.4924860455 26.4920566767 12.7951910691 6.26878488622 3.47788750537 0.0
Louisville east 4 100.0 58.3082868184 29.2400171748 14.1262344354 6.82696436239 3.30613997424 0.0
Xavier west 6 100.0 50.1502790897 25.1610133104 11.9793902963 7.29927007299 3.17732932589 0.0
Oregon west 8 100.0 52.2112494633 22.4559896951 12.2799484757 5.28123658222 2.9197080292 0.0
Ohio St. west 10 100.0 50.493774152 26.8355517389 13.6539287248 5.53885787892 2.79089738085 0.0
Arkansas west 5 100.0 60.3263203091 28.9823958781 11.8935165307 5.88235294118 2.70502361529 0.0
VCU west 7 100.0 49.506225848 23.5723486475 12.1082009446 5.83941605839 2.53327608416 0.0
Providence east 6 100.0 51.1807642765 21.4255045084 9.44611421211 4.46543580936 2.44740231859 0.0
North Carolina St. east 8 100.0 55.3885787892 24.6887075998 13.009875483 4.8518677544 2.14684413912 0.0
Mississippi west 11 100.0 49.8497209103 24.6887075998 10.9489051095 4.76599398884 1.9321597252 0.0
Dayton east 11 100.0 48.8192357235 20.0085873766 9.14555603263 4.03606698154 1.9321597252 0.0
Oklahoma St. west 9 100.0 47.7887505367 22.0695577501 11.5500214684 4.20781451267 1.71747531129 0.0
Georgia St. west 14 100.0 44.6543580936 19.2357234865 7.64276513525 3.82138256763 1.71747531129 0.0
Georgia east 10 100.0 42.5075139545 20.6097037355 9.40317732933 3.60669815371 1.63160154573 0.0
Harvard west 13 100.0 41.1335337055 20.0515242593 8.41562902533 3.34907685702 1.58866466295 0.0
Wofford west 12 100.0 39.6736796909 15.5431515672 5.88235294118 2.9197080292 1.37398024903 0.0
LSU east 9 100.0 44.6114212108 18.3769858308 8.75912408759 3.69257191928 1.33104336625 0.0
Belmont east 15 100.0 38.256762559 13.1386861314 6.65521683126 2.10390725633 1.07342206956 0.0
Coastal Carolina west 16 100.0 37.9991412623 15.972520395 5.75354229283 2.36152855303 0.901674538429 0.0
Wyoming east 12 100.0 40.4465435809 17.1747531129 5.5817947617 1.71747531129 0.686990124517 0.0
UC Irvine east 13 100.0 41.6917131816 16.4448261056 5.66766852726 1.63160154573 0.515242593388 0.0
Texas Southern west 15 100.0 37.0974667239 12.022327179 3.69257191928 1.2022327179 0.515242593388 0.0
Lafayette east 16 100.0 31.94504079 9.10261914985 3.6496350365 1.15929583512 0.472305710605 0.0
Albany east 14 100.0 36.1957921855 15.8437097467 4.3795620438 1.50279089738 0.429368827823 0.0
Duke south 1 100.0 61.8291112065 40.1459854015 22.1124946329 13.5680549592 0.0 0.0
Gonzaga south 2 100.0 55.5173894375 35.0364963504 21.4684413912 10.0472305711 0.0 0.0
Utah south 5 100.0 54.8733361958 30.184628596 15.8866466295 9.7037355088 0.0 0.0
Iowa St. south 3 100.0 59.167024474 32.546157149 17.9905538858 9.36024044654 0.0 0.0
Georgetown south 4 100.0 58.780592529 30.184628596 16.1013310434 8.54443967368 0.0 0.0
St. John's south 9 100.0 49.0339201374 23.7011592958 13.7827393731 6.86990124517 0.0 0.0
Stephen F. Austin south 12 100.0 45.1266638042 24.9033920137 10.261914985 6.39759553456 0.0 0.0
SMU south 6 100.0 54.6157148991 27.6513525118 13.8686131387 6.35465865178 0.0 0.0
Davidson south 10 100.0 48.6045513096 24.3022756548 12.022327179 5.5817947617 0.0 0.0
San Diego St. south 8 100.0 50.9660798626 22.6706741091 12.2799484757 5.49592099614 0.0 0.0
Iowa south 7 100.0 51.3954486904 24.4740231859 12.5375697724 5.45298411335 0.0 0.0
UCLA south 11 100.0 45.3842851009 23.1000429369 11.1635895234 5.06655216831 0.0 0.0
Eastern Washington south 13 100.0 41.219407471 14.7273507943 5.88235294118 2.49033920137 0.0 0.0
North Dakota St. south 15 100.0 44.4826105625 16.1872048089 6.18291112065 2.06097037355 0.0 0.0
UAB south 14 100.0 40.832975526 16.7024474023 4.76599398884 1.84628595964 0.0 0.0
Robert Morris south 16 100.0 38.1708887935 13.4821811936 3.69257191928 1.15929583512 0.0 0.0
Kentucky midwest 1 100.0 61.743237441 39.6307428081 21.8119364534 0.0 0.0 0.0
Maryland midwest 4 100.0 59.4246457707 31.2151137827 16.3160154573 0.0 0.0 0.0
Purdue midwest 9 100.0 55.7320738514 27.4796049807 13.3963074281 0.0 0.0 0.0
Buffalo midwest 12 100.0 47.7458136539 24.4310863031 12.2799484757 0.0 0.0 0.0
West Virginia midwest 5 100.0 52.2541863461 23.7440961786 11.8076427651 0.0 0.0 0.0
Valparaiso midwest 13 100.0 40.5753542293 20.6097037355 9.9613568055 0.0 0.0 0.0
Cincinnati midwest 8 100.0 44.2679261486 20.5238299699 9.9613568055 0.0 0.0 0.0
Hampton midwest 16 100.0 38.256762559 12.3658222413 4.46543580936 0.0 0.0 0.0
Texas midwest 11 100.0 51.5671962216 30.9145556033 0.0 0.0 0.0 0.0
Notre Dame midwest 3 100.0 58.4800343495 30.2275654787 0.0 0.0 0.0 0.0
Butler midwest 6 100.0 48.4328037784 23.0141691713 0.0 0.0 0.0 0.0
Northeastern midwest 14 100.0 41.5199656505 15.8437097467 0.0 0.0 0.0 0.0
Wichita St. midwest 7 100.0 55.2168312581 0.0 0.0 0.0 0.0 0.0
Indiana midwest 10 100.0 44.7831687419 0.0 0.0 0.0 0.0 0.0
New Mexico St. midwest 15 100.0 0.0 0.0 0.0 0.0 0.0 0.0
In [21]:
sr.most_common_bracket
Out[21]:
Kentucky (1)                                                                   
Hampton (16)              Ham (16)                                             
Cincinnati (8)                                                                 
Purdue (9)                Pur (9)  Pur (9)                                     
West Virginia (5)                                                              
Buffalo (12)              Buf (12)                                             
Maryland (4)                                                                   
Valparaiso (13)           Val (13) Buf (12) Pur (9)                            
Butler (6)                                                                     
Texas (11)                Tex (11)                                             
Notre Dame (3)                                                                 
Northeastern (14)         Not (3)  Not (3)                                     
Wichita St. (7)                                                                
Indiana (10)              Ind (10)                                             
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Kan (2)  Kan (2)  Kan (2)                   
Duke (1)                                                                       
Robert Morris (16)        Duk (1)                                              
San Diego St. (8)                                                              
St. John's (9)            St. (9)  Duk (1)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Ste (12)                                             
Georgetown (4)                                                                 
Eastern Washington (13)   Geo (4)  Ste (12) Ste (12)                           
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  Iow (3)  SMU (6)                                     
Iowa (7)                                                                       
Davidson (10)             Iow (7)                                              
Gonzaga (2)                                                                    
North Dakota St. (15)     Gon (2)  Gon (2)  Gon (2)  Ste (12) Kan (2)          
Wisconsin (1)             Wis (1)  Okl (9)  Wof (12) Wof (12) Okl (3)  Kan (2) 
Coastal Carolina (16)                                                          
Oregon (8)                Okl (9)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Wof (12) Wof (12)                                    
Wofford (12)                                                                   
North Carolina (4)        Har (13)                                             
Harvard (13)                                                                   
Xavier (6)                Xav (6)  Bay (3)  Ari (2)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   Ohi (10) Ari (2)                                     
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Laf (16) Nor (8)  Nor (8)  Okl (3)                   
Lafayette (16)                                                                 
North Carolina St. (8)    Nor (8)                                              
LSU (9)                                                                        
Northern Iowa (5)         Wyo (12) Lou (4)                                     
Wyoming (12)                                                                   
Louisville (4)            Lou (4)                                              
UC Irvine (13)                                                                 
Providence (6)            Pro (6)  Okl (3)  Okl (3)                            
Dayton (11)                                                                    
Oklahoma (3)              Okl (3)                                              
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Mic (7)                                     
Georgia (10)                                                                   
Virginia (2)              Vir (2)                                              
Belmont (15)                                                                   
Total bracket energy: -35.9796902956
In [22]:
sr.most_common_bracket_count
Out[22]:
4

Huh. Well, that's not exactly my pick for best bracket, but there you have it.

Excersizes for the reader:

  • make a better energy function, using maybe a weighted average of different rankings. I slurped in KenPom and Jeff Sagarin, but you could add your own.
  • come up with a better "hometown wins" version. E.g. explicitly check for KU, and tweak the rankings.
In [46]:
from MarchMadnessMonteCarlo import RankingsAndStrength as RAS
strength = RAS.kenpom['Pyth']
jsstrength = RAS.sagarin['Rating']

def weighted_KU_energy_game(winner, loser):
    if winner == 'Kansas':
        win_pct = 0.99
    elif loser == 'Kansas':
        win_pct = 0.01
    else:
        A,B = strength[winner],strength[loser]
        # see http://207.56.97.150/articles/playoff2002.htm
        kenpom = (A-A*B)/(A+B-2*A*B)
        A,B = jsstrength[winner]/100, jsstrength[loser]/100
        A,B = min(A,0.9999),min(B,0.9999)
        sagarin = (A-A*B)/(A+B-2*A*B)
        win_pct = 0.70*kenpom + 0.30*sagarin
    return -win_pct
In [48]:
MMMC.set_energy_function(weighted_KU_energy_game)
In [55]:
results = MMMC.runbracket1(ntrials=100000,T=1.5)
Lowest energy bracket
Kentucky (1)                                                                   
Hampton (16)              Ken (1)                                              
Cincinnati (8)                                                                 
Purdue (9)                Cin (8)  Ken (1)                                     
West Virginia (5)                                                              
Buffalo (12)              Wes (5)                                              
Maryland (4)                                                                   
Valparaiso (13)           Mar (4)  Wes (5)  Ken (1)                            
Butler (6)                                                                     
Texas (11)                Tex (11)                                             
Notre Dame (3)                                                                 
Northeastern (14)         Not (3)  Not (3)                                     
Wichita St. (7)                                                                
Indiana (10)              Wic (7)                                              
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Kan (2)  Kan (2)  Kan (2)                   
Duke (1)                                                                       
Robert Morris (16)        Duk (1)                                              
San Diego St. (8)                                                              
St. John's (9)            San (8)  Duk (1)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Uta (5)                                              
Georgetown (4)                                                                 
Eastern Washington (13)   Geo (4)  Uta (5)  Duk (1)                            
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  Iow (3)  Iow (3)                                     
Iowa (7)                                                                       
Davidson (10)             Iow (7)                                              
Gonzaga (2)                                                                    
North Dakota St. (15)     Gon (2)  Gon (2)  Gon (2)  Duk (1)  Kan (2)          
Wisconsin (1)             Wis (1)  Wis (1)  Wis (1)  Ari (2)  Ari (2)  Kan (2) 
Coastal Carolina (16)                                                          
Oregon (8)                Okl (9)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Ark (5)  Nor (4)                                     
Wofford (12)                                                                   
North Carolina (4)        Nor (4)                                              
Harvard (13)                                                                   
Xavier (6)                Xav (6)  Bay (3)  Ari (2)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   Ohi (10) Ari (2)                                     
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Vil (1)  Vil (1)  Vil (1)  Vir (2)                   
Lafayette (16)                                                                 
North Carolina St. (8)    Nor (8)                                              
LSU (9)                                                                        
Northern Iowa (5)         Nor (5)  Nor (5)                                     
Wyoming (12)                                                                   
Louisville (4)            Lou (4)                                              
UC Irvine (13)                                                                 
Providence (6)            Pro (6)  Okl (3)  Vir (2)                            
Dayton (11)                                                                    
Oklahoma (3)              Okl (3)                                              
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Vir (2)                                     
Georgia (10)                                                                   
Virginia (2)              Vir (2)                                              
Belmont (15)                                                                   
Total bracket energy: -45.7476455682

Most common bracket (13)
Kentucky (1)                                                                   
Hampton (16)              Ken (1)                                              
Cincinnati (8)                                                                 
Purdue (9)                Pur (9)  Pur (9)                                     
West Virginia (5)                                                              
Buffalo (12)              Buf (12)                                             
Maryland (4)                                                                   
Valparaiso (13)           Mar (4)  Buf (12) Buf (12)                           
Butler (6)                                                                     
Texas (11)                But (6)                                              
Notre Dame (3)                                                                 
Northeastern (14)         Not (3)  Not (3)                                     
Wichita St. (7)                                                                
Indiana (10)              Wic (7)                                              
Kansas (2)                                                                     
New Mexico St. (15)       Kan (2)  Kan (2)  Kan (2)  Kan (2)                   
Duke (1)                                                                       
Robert Morris (16)        Duk (1)                                              
San Diego St. (8)                                                              
St. John's (9)            San (8)  San (8)                                     
Utah (5)                                                                       
Stephen F. Austin (12)    Ste (12)                                             
Georgetown (4)                                                                 
Eastern Washington (13)   Geo (4)  Ste (12) San (8)                            
SMU (6)                                                                        
UCLA (11)                 SMU (6)                                              
Iowa St. (3)                                                                   
UAB (14)                  UAB (14) UAB (14)                                    
Iowa (7)                                                                       
Davidson (10)             Iow (7)                                              
Gonzaga (2)                                                                    
North Dakota St. (15)     Gon (2)  Gon (2)  Gon (2)  Gon (2)  Kan (2)          
Wisconsin (1)             Coa (16) Okl (9)  Nor (4)  Ari (2)  Ari (2)  Kan (2) 
Coastal Carolina (16)                                                          
Oregon (8)                Okl (9)                                              
Oklahoma St. (9)                                                               
Arkansas (5)              Wof (12) Nor (4)                                     
Wofford (12)                                                                   
North Carolina (4)        Nor (4)                                              
Harvard (13)                                                                   
Xavier (6)                Mis (11) Mis (11) Ari (2)                            
Mississippi (11)                                                               
Baylor (3)                Bay (3)                                              
Georgia St. (14)                                                               
VCU (7)                   VCU (7)  Ari (2)                                     
Ohio St. (10)                                                                  
Arizona (2)               Ari (2)                                              
Texas Southern (15)                                                            
Villanova (1)             Vil (1)  Nor (8)  Nor (8)  Nor (8)                   
Lafayette (16)                                                                 
North Carolina St. (8)    Nor (8)                                              
LSU (9)                                                                        
Northern Iowa (5)         Nor (5)  Lou (4)                                     
Wyoming (12)                                                                   
Louisville (4)            Lou (4)                                              
UC Irvine (13)                                                                 
Providence (6)            Pro (6)  Okl (3)  Mic (7)                            
Dayton (11)                                                                    
Oklahoma (3)              Okl (3)                                              
Albany (14)                                                                    
Michigan St. (7)          Mic (7)  Mic (7)                                     
Georgia (10)                                                                   
Virginia (2)              Bel (15)                                             
Belmont (15)                                                                   
Total bracket energy: -38.9077921757

In [56]:
MMMC.showstats(results['all'],description='full run',newfig=True)
In [57]:
h = HTML(MMMC.maketable(results))
h
Out[57]:
Team Region Rank 2nd Round 3rd Round Sweet 16 Elite 8 Final 4 Championship Win
Kansas midwest 2 100.0 71.415 55.343 44.429 35.557 28.44 22.414
Kentucky midwest 1 100.0 69.107 50.644 37.446 18.891 13.788 9.643
Arizona west 2 100.0 65.022 43.251 29.626 19.515 12.801 6.535
Wisconsin west 1 100.0 64.476 43.066 29.09 17.775 11.201 5.467
Virginia east 2 100.0 65.076 42.901 28.131 17.706 10.58 5.127
Villanova east 1 100.0 63.795 42.213 27.546 17.299 10.203 4.848
Duke south 1 100.0 63.066 40.645 25.96 15.924 6.731 3.64
Gonzaga south 2 100.0 62.557 37.869 24.019 14.332 6.157 3.354
Utah south 5 100.0 54.831 33.958 18.603 10.684 4.171 2.105
North Carolina west 4 100.0 55.798 35.168 18.059 9.103 4.648 1.857
Oklahoma east 3 100.0 57.558 32.89 16.602 8.699 4.476 1.856
Iowa St. south 3 100.0 58.435 32.387 17.002 9.079 3.519 1.74
Baylor west 3 100.0 57.425 33.143 16.356 8.703 4.328 1.714
Northern Iowa east 5 100.0 58.362 33.246 16.966 8.687 4.148 1.653
Notre Dame midwest 3 100.0 58.973 33.146 12.23 5.888 3.064 1.459
Louisville east 4 100.0 56.572 30.948 15.201 7.349 3.329 1.289
Wichita St. midwest 7 100.0 55.022 19.553 10.731 5.057 2.614 1.235
Michigan St. east 7 100.0 53.365 25.012 13.642 6.637 3.041 1.157
Ohio St. west 10 100.0 50.97 23.373 12.8 6.241 2.876 1.053
Georgetown south 4 100.0 56.351 27.242 13.727 6.735 2.386 1.046
SMU south 6 100.0 50.545 26.664 12.96 6.318 2.197 0.98
Iowa south 7 100.0 53.959 26.803 13.763 6.446 2.273 0.979
Xavier west 6 100.0 54.002 25.958 11.422 5.532 2.541 0.936
West Virginia midwest 5 100.0 55.439 30.487 13.026 4.733 2.187 0.896
Providence east 6 100.0 54.017 27.975 12.282 5.736 2.417 0.88
Arkansas west 5 100.0 55.043 25.753 11.997 5.229 2.275 0.845
Texas midwest 11 100.0 51.713 25.828 9.107 4.172 1.938 0.84
San Diego St. south 8 100.0 49.116 22.93 11.033 5.181 1.853 0.755
VCU west 7 100.0 49.03 21.187 9.927 4.523 1.995 0.744
Maryland midwest 4 100.0 53.53 26.132 10.837 4.056 1.82 0.735
Butler midwest 6 100.0 48.287 23.838 8.227 3.593 1.688 0.694
Davidson south 10 100.0 46.041 21.646 10.642 4.871 1.683 0.678
Oklahoma St. west 9 100.0 50.857 22.453 11.343 4.682 1.87 0.671
North Carolina St. east 8 100.0 51.914 22.439 11.07 4.813 1.896 0.611
St. John's south 9 100.0 50.884 23.054 10.411 4.442 1.496 0.575
Stephen F. Austin south 12 100.0 45.169 22.367 9.957 4.368 1.439 0.568
Georgia east 10 100.0 46.635 19.336 9.623 4.499 1.597 0.567
Oregon west 8 100.0 49.143 20.518 9.907 4.144 1.678 0.559
UCLA south 11 100.0 49.455 24.063 10.453 4.388 1.575 0.554
LSU east 9 100.0 48.086 20.5 9.842 4.191 1.708 0.551
Mississippi west 11 100.0 45.998 21.001 8.299 3.623 1.44 0.541
Cincinnati midwest 8 100.0 52.308 20.082 10.116 3.18 1.382 0.541
Dayton east 11 100.0 45.983 23.105 9.306 3.898 1.639 0.495
Purdue midwest 9 100.0 47.692 18.407 8.311 2.991 1.216 0.453
Buffalo midwest 12 100.0 44.561 22.616 8.681 2.764 1.099 0.392
Indiana midwest 10 100.0 44.978 13.846 5.745 2.269 0.928 0.344
Harvard west 13 100.0 44.202 21.121 7.51 2.602 0.998 0.339
Georgia St. west 14 100.0 42.575 19.898 7.477 2.894 1.147 0.321
Valparaiso midwest 13 100.0 46.47 20.765 7.922 2.461 0.91 0.286
Wyoming east 12 100.0 41.638 18.284 7.123 2.652 0.927 0.268
UC Irvine east 13 100.0 43.428 17.522 6.975 2.726 0.965 0.263
Wofford west 12 100.0 44.957 17.958 6.914 2.297 0.81 0.241
UAB south 14 100.0 41.565 16.886 6.077 2.033 0.673 0.222
Northeastern midwest 14 100.0 41.027 17.188 5.238 1.867 0.626 0.182
Eastern Washington south 13 100.0 43.649 16.433 5.605 1.968 0.527 0.168
New Mexico St. midwest 15 100.0 28.585 11.258 4.293 1.566 0.487 0.165
Albany east 14 100.0 42.442 16.03 5.183 1.709 0.542 0.151
North Dakota St. south 15 100.0 37.443 13.682 5.084 1.791 0.499 0.145
Lafayette east 16 100.0 36.205 14.848 5.277 1.578 0.462 0.142
Belmont east 15 100.0 34.924 12.751 5.231 1.821 0.531 0.139
Coastal Carolina west 16 100.0 35.524 13.963 5.18 1.746 0.538 0.138
Robert Morris south 16 100.0 36.934 13.371 4.704 1.44 0.372 0.1
Texas Southern west 15 100.0 34.978 12.189 4.093 1.391 0.393 0.085
Hampton midwest 16 100.0 30.893 10.867 3.661 0.955 0.262 0.069

Looks good! I'm in!

Evolution of a Learning Goal

I just got back from the Lilly Conference on College Teaching. The first workshop I went to was on course design. One chunk of this was on learning goals. This came in the second half of the workshop, after we’d talked quite a bit about learning factors, etc. I chose to work on my 200-level Biophysics class, where I thought I had decent goals already. One of the ones I particularly liked was

Use simple physical models to provide quantitative insight into biological systems.

The people running the workshop have some good reasons for insisting on particular phrasing, and they recommend either “students will be able to [verb] [object]” or “given [thing] students will be able to [other thing].” Fair enough, if I lighten up a bit on the need for an action verb, I get

Students will be able to use simple physical models to provide quantitative insight into biological systems.

We talked a bit about how many goals a class should have. Their book recommends 3-5. If you have a ton more than that, you’re listing specific objectives rather than overarching goals. One of the authors/instructors actually recommends a single goal. If you can do it, it gives you a really nice way to focus the class, makes you comfortable cutting material in the interest of deep(er) learning, etc. The danger, of course, is that you don’t want a goal that’s general enough as to be meaningless.

We also talked about the time-frame of a goal. If you’re setting a 5-year goal (i.e. what do you want them to be like in 5 years), they have 5 years between now and the goal, so you don’t get to take credit (or responsibility) for the whole goal; you just need to plant a seed.

In thinking more about the class and about attitudes, it became clear that I really wanted the students to be able to identify a problem and play with it.

Students will be able to identify a biological problem and play with it, using simple physical models to provide quantitative insight.

At this point, we talked a bit more about the purpose of goals, and the idea that you want the class to be committed to the goals, and I was introduced to what may be my new favorite fable/punchline: “in a ham and egg breakfast, the chickin is involved; the pig is committed.”

Anyway, the above goal is sounding wordy, and is really trying to do too much at once. Worse, “play with it” is maybe clear to the instructor, but really unlikely to be clear to the students. There are at least two things going on here, and they’re worth separating: I’ll have to teach them as separate skills, students will certainly learn them as separate skills, and I should be explicit.

  • Given an identified biological problem, students will be able to identify several categories of physical models with which the problem might be addressed.
  • Given a biological problem and a physical model, students will be able to provide quantitative insight into the system.

At this point, it’s close, but there’s a bit more work to be done with categorizing the physical model(s). I mean, the students aren’t going to be dropping in an extraordinarily complex set of models for a 200-level class that requires only a semester of calc-based mechanics as a pre-req:

  • Given a biological system and an appropriate simple physical model, students will be able to adapt the model to the system, providing quantitative insight.
  • Given a biological problem, students will be able to identify several [candidate? potential?] physical models to apply to the system.

The idea of “adapting the model to the system” was an important addition, and a clarification of “play with.” It’s also something that I expect most students to have relatively little experience with, especially in an interdisciplinary context! In addition to a few other things, this was where I started monkeying with the “quantitative insight” wording, and where to put it in the goal.

  • Given a biological system, students will be able to name the appropriate physical models that apply.
  • Given a biological system, take a physical model and appropriately adapt it through quantitative analysis to that specific system.

tweak a bit

  • Given a biological system and a physical model, students will be able to appropriately adapt the model to the system to provide quantitative [insight?] analysis.
  • Given a biological system, students will be able to name appropriate physical models.

That’s pretty close. My penultimate version was a combination of the last few.

  • Given a biological system and a physical model, students will be able to adapt the model to the system through quantitative analysis.
  • Given a biological system, students will be able to name relevant physical models.

“… be able to name …” doesn’t sound so strong, even though “relevant” carries a lot of weight. After looking at Bloom’s Taxonomy of Educational Objectives, a simple switch from “name” to “predict” captures what I want, and makes the goal much higher-level. I’ve added the third (and final) goal here:

  • Given a biological system and a physical model, students will be able to adapt the model to the system through quantitative analysis.
  • Given a biological system, students will be able to predict relevant physical models.
  • Students will gain exposure to important questions in the modern field of molecular biophysics, and evaluate current research on a system of their choice.

That’s much better, both for me and the students! And that’s despite the fact that my original goal (“use simple physical models to provide quantitative insight into biological systems”) still sounds decent. I wonder if I should use that phrasing, and perhaps something about “play with a model” in the course description rather than the goals.

Lookup Tables

[twitter-follow username="mglerner" scheme="dark"]

This whole post is available as an IPython Notebook here.

Lookup Tables with Lagrangian Interpolation

One of my students wanted to speed up the calculation of exp(x) in a simulation. There are a few ways to do this, but a lookup table is often a huge win in situations like this. The basic idea is that, for an expensive function like exp(x), you pre-calculate exp(x) for a bunch of values that cover the range in which you're interested. You then look things up in the table at runtime. If the exact value you want isn't in the table, you use a cheap interpolation function. By tweaking the density of your pre-calculated values and the sophistication of your interpolation function, you can get results that are quite close to exact calculations for a fraction of the run-time cost.

Sadly for me, I didn't know a bunch about which interpolation functions to use, so I asked Andy Simmonett. I wrote the Python bits below, but the general explanation is direct from him, with some light modifications. He's a QM/MM guy, so some of what is written below should be taken in the context of molecular simulations.

Before anything else, let's set up Python and use Seaborn for good-looking default plotting parameters.

In [1]:
from __future__ import division
import sys, os
import numpy as np, scipy as sp, pandas as pd, matplotlib as mpl
from scipy import stats
import matplotlib.pyplot as plt
import seaborn
seaborn.set()

%install_ext https://raw.githubusercontent.com/rasbt/watermark/master/watermark.py
%load_ext watermark
%watermark -v -m -p numpy,pandas,scipy,matplotlib
%matplotlib inline
Installed watermark.py. To use it, type:
  %load_ext watermark
CPython 2.7.8
IPython 2.3.0

numpy 1.9.1
pandas 0.14.1
scipy 0.14.0
matplotlib 1.4.2

compiler   : GCC 4.4.7 20120313 (Red Hat 4.4.7-1)
system     : Linux
release    : 3.16.2-200.fc20.x86_64
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit

The General strategy

I [Andy] don’t know of any decent source about lookup tables, but here are some notes that demonstrate how we figured out the splines. The simplest approach to understand is Lagrangian Interpolation wiki wolfram, which is the approach that we used because it’s so general. The strategy is as follows:-

  1. Choose a range of inputs values (∆E/kT in your case) that you expect to encounter frequently (say, 0 to 3). I’ve defined the function below such that an input of the positive argument ∆E/kT returns Exp[-∆E/kT].

  2. Choose a spacing value (referred to below as del) between successive grid points; this will determine both the storage needed for the table and also the accuracy of the interpolation, so some experimentation is necessary. The example below uses del=0.1, but generally you need at least a hundred grid points per unit to get single precision accuracy, and even more for double precision.

  3. You need to allocate (range/del+1) grid points to hold the table of values. For a quartic interpolation (see below) you need two extra points, to handle the end points (-0.1 and 3.1 in the example above, so you can interpolate the full range).

  4. Now you need to construct your table: Tab = {Exp[0.1], Exp[0], Exp[-0.1], Exp[-0.2], …, Exp[-3.1]}

  5. For a given input, x, if it’s outside the range of your interpolation table, just explicitly compute and return Exp[-x].

  6. If it’s inside the range, use an interpolating polynomial (below) to interpolate the values.

In [2]:
from numpy import exp,sqrt,sin
def gettable(start,stop,d,f):
    return f(np.arange(start-2*d,stop+2*d,d)) # enough for quintic

Wikipedia has some good info on Lagrangian interpolation (see links above). I’ve [Andy] pasted the explicit code needed for cubic, quartic, and quintic splines; these were obtained using the corresponding Mathematica inputs. Most codes use cubic splines for efficiency, but we found that quartic splines let you use a coarser table, so they may be more cache effiecient.

  • cubic

     Simplify[InterpolatingPolynomial[{{x0 - del, e0}, {x0, e1}, {x0 + del, e2}}, x]]
    
     (2 del^2 e1 - del (e0 - e2) (x - x0) + (e0 - 2 e1 + e2) (x - x0)^2)/(2 del^2)
    
  • quartic

     Simplify[InterpolatingPolynomial[{{x0 - del, e0}, {x0, e1}, {x0 + del, e2}, {x0 + 2 del, e3}}, x]]
    
     -((-6 del^3 e1 + del^2 (2 e0 + 3 e1 - 6 e2 + e3) (x - x0) - 3 del (e0 - 2 e1 + e2) (x - x0)^2 + (e0 - 3 e1 + 3 e2 - e3) (x - x0)^3)/(6 del^3))
    
  • quintic

     Simplify[InterpolatingPolynomial[{{x0 - 2 del, e0}, {x0 - del, e1}, {x0, e2}, {x0 + del, e3}, {x0 + 2 del, e4}}, x]]
    
     (1/(24 del^4))(24 del^4 e2 + 2 del^3 (e0 - 8 e1 + 8 e3 - e4) (x - x0) - 
      del^2 (e0 - 16 e1 + 30 e2 - 16 e3 + e4) (x - x0)^2 - 
      2 del (e0 - 2 e1 + 2 e3 - e4) (x - x0)^3 + (e0 - 4 e1 + 6 e2 - 4 e3 + 
      e4) (x - x0)^4)
    

Assume we’re using a quartic polynomial, and want to compute Exp[-0.22]. We need to pick out the 4 consecutive table entries {e0,e1,e2,e3} whose x values bound 0.22, so that we’re interpolating, not extrapolating. This is shown below (# represents the value we’re interested in).

  e0    e1     e2     e3
  |      | #    |      |
  ----------------------
x0-del   x0  x0+del x0+2del
 0.1    0.2    0.3    0.4

The values {e0,e1,e2,e3} are {Exp[-0.1],Exp[-0.2],Exp[-0.3],Exp[-0.4]}, taken straight from the table. The we just plug in (x-x0 = 0.22-0.2 = 0.02) into the quartic formula above, along with the tablulated e* values, and out comes the answer.

For an odd-order interpolating spline, make sure the value you're probing is in the middle of the range of table values, although it doesn’t matter which side of the center point if falls on; make sure the corresponding sign is correct when computing x-x0 though.

This is using Lagrangian interpolation. PME uses and Euler interpolating spline instead, so that might be worth investigating. I’ve attached the paper in case you’re interested. Note that the method above works for absolutely any function; you just have to tabulate the compute values ahead of time, which is cheap. To test the table, you can simply probe all of the midpoints (0.05, 0.15, 0.25, … in the example above) and compare the interpolated and analytic function. If it’s something that’s commonly run in single precision anyway, you should be able to get away with errors around 10^-7 or 10^-8. We get 10^-13, in our code, which is close enough to the machine epsilon for our liking.

Note I also stuck in linear interpolation for comparison.

In [3]:
def cubic(x,f,table,start,stop,d):
    try:
        #i0 will be the table index of the largest element lower than x.
        i0 = int((x - start)//d) + 2 # because we have two extra entries
        x0 = start + d*(i0 -2)
        e0, e1, e2 = table[i0-1:i0+2]
        #print("{i0} {x0} {e0} {e1} {e2} {e3}".format(**locals()))
        return (2*e1*d**2 - d*(e0 - e2)*(x - x0) + (e0 - 2*e1 + e2)*(x - x0)**2)/(2*d**2)
    # You'd think IndexError, but it comes from grabbing N table entries.
    # If the start or stop value is too high, you just won't be able to extract
    # The full three (in this case) entries from the table error, so the
    # e0, e1, e2 = ... line will raise a ValueError
    except ValueError: 
        return f(x)
def quartic(x,f,table,start,stop,d):
    try:
        i0 = int((x - start)//d) + 2
        x0 = start + d*(i0 -2)
        e0, e1, e2, e3 = table[i0-1:i0+3]
        return -((-6*e1*d**3 + d**2 * (2*e0 + 3*e1 - 6*e2 + e3)*(x - x0) 
                  - 3*d*(e0 - 2*e1 + e2)*(x - x0)**2 
                  + (e0 - 3*e1 + 3*e2 - e3)*(x - x0)**3)/(6*d**3))
    except ValueError:
        return f(x)
def quintic(x,f,table,start,stop,d):
    try:
        i0 = int((x - start)//d) + 2
        x0 = start + d*(i0 -2)
        e0, e1, e2, e3, e4 = table[i0-2:i0+3]
        return (1/(24*d**4))*( 24*d**4*e2 + 2*d**3*(e0 - 8*e1 + 8*e3 - e4)*(x - x0) - 
                d**2*(e0 - 16*e1 + 30*e2 - 16*e3 + e4)*(x - x0)**2 - 
                2*d*(e0 - 2*e1 + 2*e3 - e4)*(x - x0)**3 + 
                (e0 - 4*e1 + 6*e2 - 4*e3 + e4)*(x - x0)**4 )
    except ValueError:
        return f(x)
# and why not
def linear(x,f,table,start,stop,d):
    try:
        i0 = int((x - start)//d) + 2
        x0 = start + d*(i0 -2)
        e0, e1,  = table[i0:i0+2]
        #m = (e1-e0)/d
        #b = e0 - m*x0
        #return m*x + b
        return e0 + (x-x0)*(e1-e0)/d
    except ValueError:
        return f(x)
    

def getinterp(order):
    return {2:linear,3:cubic,4:quartic,5:quintic}[order]
def getinterpname(order):
    return {2:'linear',3:'cubic',4:'quartic',5:'quintic'}[order]

So let's test, giving each one values inside the range as well as on both sides

In [4]:
start,stop,d = 0,3,0.1
def f(x):
    return exp(-x)
table = gettable(start,stop,d,f)
def printapprox(order,val,f):
    interp = getinterp(order)
    exact,interpd = f(val),interp(val,f,table,start,stop,d)
    pdiff = abs(100 - 100*interpd/exact)
    print("exact {a} approx {b} %diff {c}".format(a=exact,b=interpd,c=pdiff))
def getf(name):
    def fexp(x):
        return exp(-x)
    def fsqr(x):
        return sqrt(x)
    def fsin(x):
        return sin(4*x)
    return {'exp':fexp,'sqrt':sqrt,'sin':fsin}[name]

print("Linear")
printapprox(2,.22,f)
printapprox(2,-.22,f)
printapprox(2,5.22,f)
print("Cubic")
printapprox(3,.22,f)
printapprox(3,-.22,f)
printapprox(3,5.22,f)
print("Quartic")
printapprox(4,.22,f)
printapprox(4,-.22,f)
printapprox(4,5.22,f)
print("Quintic")
printapprox(5,.22,f)
printapprox(5,-.22,f)
printapprox(5,5.22,f)
Linear
exact 0.802518797962 approx 0.803148246599 %diff 0.0784341298732
exact 1.24607673059 approx 1.24607673059 %diff 0.0
exact 0.00540732912644 approx 0.00540732912644 %diff 0.0
Cubic
exact 0.802518797962 approx 0.802492715994 %diff 0.00325001342684
exact 1.24607673059 approx 1.24607673059 %diff 0.0
exact 0.00540732912644 approx 0.00540732912644 %diff 0.0
Quartic
exact 0.802518797962 approx 0.802517668788 %diff 0.000140703834049
exact 1.24607673059 approx 1.24607673059 %diff 0.0
exact 0.00540732912644 approx 0.00540732912644 %diff 0.0
Quintic
exact 0.802518797962 approx 0.802518849726 %diff 6.45019096623e-06
exact 1.24607673059 approx 1.24607673059 %diff 0.0
exact 0.00540732912644 approx 0.00540732912644 %diff 0.0

That looks good. It's worth noting that this algorithm gives bad results in at least one clear case: for sqrt, the interpolations that try to use values for x < 0 will all give nan. One easy fix would be to test for a nan return value and just return the exact value. You'd have to consider how close your actual values are likely to be to zero, though, as that test will have to run for every lookup. There are probably smarter fixes. Here's a demonstration:

In [5]:
table = gettable(start=0,stop=3,d=0.1,f=getf('sqrt'))
cubic(x=0.01,f=getf('sqrt'),table=table,start=0,stop=3,d=0.1)
Out[5]:
nan

Now let's look at this visually.

Note that I cap the lower axis at 150%. When you're calculating numbers very close to zero (see sin for clear examples), the percentage error can get huge, and you might want to use other techniques (Taylor expansion, etc.).

In [6]:
from IPython.html import widgets 
from IPython.html.widgets import interact

figsize=(10,10)
def showinterp(d,order,fname='exp',rangeexp=0):
    fig = plt.figure(figsize=figsize)
    f = getf(fname)
    start,stop = 0,3
    start,stop,d = start*10**rangeexp,stop*10**rangeexp,d*10**rangeexp
    x = np.arange(start,stop,d/100.0)
    table = gettable(start,stop,d,f)
    exact = f(x)
    # Make these ufuncs for some huge speedup
    interp = getinterp(order)
    interpd = np.array([interp(i,f,table,start,stop,d) for i in x])
    #print(interpd[:10])
    err = interpd - exact
    errfrac = err/exact
    xerrfrac = x[~np.isnan(errfrac)] # for plotting later
    errfrac = errfrac[~np.isnan(errfrac)]
    errfrac = np.abs(errfrac)
    perr = np.abs(100*errfrac)
    plt.subplot(3,1,1)
    plt.plot(x,exact,'k',label='exact',lw=3)
    plt.plot(x,interpd,'r-',label='{n} interp d {d}'.format(n=getinterpname(order),d=d))
    plt.legend(fancybox=True)
    plt.subplot(3,1,2)
    plt.plot(x,err,'k-',label='error')
    plt.legend(fancybox=True)
    plt.subplot(3,1,3)
    plt.plot(xerrfrac,perr,'k-',label='% error')
    if plt.ylim()[1] > 151:
        plt.ylim(plt.ylim()[0],151)
    plt.legend(fancybox=True)
    #plt.grid(True)
    m = np.max(err[~np.isnan(err)])
    a = np.average(err[~np.isnan(err)])
    mp = np.max(perr)
    ap = np.average(perr)
    plt.xlabel('{order} {d} {start} {stop} Max err {m:g}[{mp:g}%] avg {a:g}[{ap:g}%]'.format(**locals()))
    return fig
interact(showinterp,d=widgets.FloatSliderWidget(min=.001,max=.5,step=.001,value=.1),
         order=(2,5),
         fname=['exp','sqrt','sin'],
         rangeexp=widgets.IntSliderWidget(min=-20,max=0,step=20,value=0)
         )

That's awesome in a live notebook. Let's do something similar that we can look at statically. There are two main choices here: JSAnimation and IPy-Widgets, both by the indomitable Jake VanderPlas. I really like the ability to play things as a movie, but ipywidgets wins for now because I can do dropdowns. You can animate it by draggin the sliders. Hey, it's like a flipbook! Rangeexp will let you play with the range, so you can see what happens when you look over small divisions between floating point numbers.

In [7]:
from ipywidgets import StaticInteract, RangeWidget, RadioWidget, DropDownWidget
figsize=(6,6)

StaticInteract(showinterp,d=RadioWidget([0.001, 0.0025, 0.005, 0.0075, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5],default=0.1),
               order=RangeWidget(2,5,1,default=3),
               fname=DropDownWidget(['exp','sqrt','sin'],default='exp'),
               rangeexp=RadioWidget([-20,0],default=0),
               )
/home/mglerner/anaconda/lib/python2.7/site-packages/matplotlib/pyplot.py:424: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

Out[7]:
d: 0.001: 0.0025: 0.005: 0.0075: 0.01: 0.025: 0.05: 0.075: 0.1: 0.25: 0.5:
fname:
order:
rangeexp: -20: 0:

Speed Testing

The obvious next thing to do here is to compare the speed of explicit calculations vs. table lookups. I think Python is likely a bad language for that; if you're doing this, you'll likely be doing it in C or FORTRAN, and you should just test the code explicitly there. Speed numbers from Python will almost surely be contaminated by Python-specific overhead. If you're using Python for time-critical code, you should probably check out The IPython Cookbook or Python High Performance Programming.

In [1]:
%%html
    <div id="disqus_thread"></div>
    <script type="text/javascript">
        /* * * CONFIGURATION VARIABLES: EDIT BEFORE PASTING INTO YOUR WEBPAGE * * */
        var disqus_shortname = 'mathematicalphysics'; // required: replace example with your forum shortname

        /* * * DON'T EDIT BELOW THIS LINE * * */
        (function() {
            var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
            dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
            (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
        })();
    </script>
    <noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
(do the above for comments via the nbviewer)
In []:
 

Making IPython Notebooks for the matplotlib examples

[twitter-follow username="mglerner" scheme="dark"]

matplotlib comes with tons of fantastic examples. I’m not as familiar with matplotlib as I probably should be, so I often find myself wanting to tinker a bit, but needing to refer to those examples. Since matplotlib comes with such wonderful documentation, I though it would be great to just turn those docs into IPython Notebooks for easy tinkering. That’s probably biting off a bit more than I want to chew at the moment, considering that the matplotlib docs are fairly involved and written in reStructuredText instead of markdown (what the IPython Notebook uses).

Luckily, the IPython Notebook format is so mind-bendingly sane that I didn’t even need to read any documentation to understand it. So, instead, I wrote a bit of code that gobbles up matplotlib example scripts and spits out IPython Notebooks. The whole notebook is JSON, but I only want simple things, so I hardcode everything except for the cells. (After Daniel’s comment below, I started to write my own JSONEncoder. Then, I realized that I was right about the “it’s all JSON” thing and rewrote the notebook class). I have a little IPyNB class that knows how to add cells to itself and spit out the results as strings and files:

import os, json, glob
        
class IPyNB():
    def __init__(self,name):
        self.name = name
        self.cells = []
        self.d = {‘metadata’:{‘name’:”},
                  ‘nbformat’:3,
                  ‘nbformat_minor’:0,
                  ‘worksheets’:[{'cells':[],’metadata’:{}}]
                  }
        
    def addcell(self,celltype,content):
        cell = {‘cell_type’:celltype,
                  ‘metadata’:{}
                  }
        if celltype == ‘code’:
            cell['collapsed'] = False
            cell['input'] = content.split(‘\n’)
            cell['language'] = ‘python’
            cell['outputs'] = []
        elif celltype == ‘markdown’:
            cell['source'] = content.split(‘\n’)
        #elif self.celltype = ‘raw’:
        else:
            raise NotImplemented(‘Unknown cell type: {celltype}’.format(celltype=self.celltype))
        self.d['worksheets'][0]['cells'].append(cell)

    def tostring(self):
        return json.dumps(self.d,
                          sort_keys=True,
                          indent=1, separators=(‘,’, ‘: ‘),
            )

    def tofile(self,outdir=’.',overwrite=False):
        fname = os.path.join(outdir,self.name+’.ipynb’)
        if os.path.isfile(fname) and not overwrite:
            raise IOError(“File {fname} already exists”.format(fname=fname))
        f = open(fname,’w')
        f.write(self.tostring())
        f.close()

I’ve defined a few celltypes. It’s easy to add more if there are more.

And then a couple of functions to read in the matplotlib examples and spit out notebooks:

def make_mpl_examples(subdir=’images_contours_and_fields’,basedir=’~/coding/matplotlib/examples’,outdir=’.',overwrite=False):
    n = IPyNB(subdir + ‘ Examples’)
    n.addcell(‘markdown’,”"”#matplotlib examples
The below examples are taken directly from the matplotlib example directory {subdir} and rendered in an IPython Notebook. Before you run them, you should execute one of the first two cells, depending on whether you want inline rendering or not. The inline rendering is usually much nicer for interactive work, but doesn’t support all features (e.g. animation).”"”.format(subdir=subdir)
    n.addcell(‘code’,'%matplotlib inline’)
    n.addcell(‘code’,'%matplotlib’)
    pat = os.path.join(os.path.expanduser(basedir),subdir,’*.py’)
    examples = glob.glob(pat)
    if not examples:
        raise IOError(‘No files found: {pat}’.format(pat=pat))
    for ex in examples:
        n.addcell(‘markdown’,'## {exname}’.format(exname=ex))
        n.addcell(‘code’,open(ex).read())
    n.tofile(overwrite=overwrite,outdir=outdir)

def make_all_mpl_examples(outdir=’output’,basedir=’~/coding/matplotlib/examples’,overwrite=False):
    # There must be a builtin!
    bd = os.path.expanduser(basedir)
    subdirs = [d for d in os.listdir(bd) if os.path.isdir(os.path.join(bd,d))]
    for s in subdirs:
        print “Doing”,s
        make_mpl_examples(subdir=s,basedir=basedir,outdir=outdir,overwrite=overwrite)

It spits out one notebook per example directory, but is easy to change. It does what I want very nicely: gives me an immediate way to tweak the matplotlib examples. I’ll leave the bigger project (turning all of the matplotlib docs into IPython Notebooks) for later, if ever. The first thing I’d need to figure out is how to execute the code in the cells programatically. I’m sure there’s an API.

Here’s an example of some examples. I’ve executed several of the cells so that there’s output. This is significantly less interesting on the web, and more interesting if you run it in a notebook on your own so that you can tweak things.

matplotlib examples

The below examples are taken directly from the matplotlib example directory images_contours_and_fields and rendered in an IPython Notebook. Before you run them, you should execute one of the first two cells, depending on whether you want inline rendering or not. The inline rendering is usually much nicer for interactive work, but doesn't support all features (e.g. animation).

In [1]:
%matplotlib inline
In []:
%matplotlib

/Users/mglerner/coding/matplotlib/examples/images_contours_and_fields/image_demo.py

In [2]:
"""
Simple demo of the imshow function.
"""
import matplotlib.pyplot as plt
import matplotlib.cbook as cbook

image_file = cbook.get_sample_data('ada.png')
image = plt.imread(image_file)

plt.imshow(image)
plt.axis('off') # clear x- and y-axes
plt.show()

/Users/mglerner/coding/matplotlib/examples/images_contours_and_fields/image_demo_clip_path.py

In [3]:
"""
Demo of image that's been clipped by a circular patch.
"""
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.cbook as cbook


image_file = cbook.get_sample_data('grace_hopper.png')
image = plt.imread(image_file)

fig, ax = plt.subplots()
im = ax.imshow(image)
patch = patches.Circle((260, 200), radius=200, transform=ax.transData)
im.set_clip_path(patch)

plt.axis('off')
plt.show()

/Users/mglerner/coding/matplotlib/examples/images_contours_and_fields/pcolormesh_levels.py

In [4]:
"""
Shows how to combine Normalization and Colormap instances to draw
"levels" in pcolor, pcolormesh and imshow type plots in a similar
way to the levels keyword argument to contour/contourf.

"""

import matplotlib.pyplot as plt
from matplotlib.colors import BoundaryNorm
from matplotlib.ticker import MaxNLocator
import numpy as np


# make these smaller to increase the resolution
dx, dy = 0.05, 0.05

# generate 2 2d grids for the x & y bounds
y, x = np.mgrid[slice(1, 5 + dy, dy),
                slice(1, 5 + dx, dx)]

z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)

# x and y are bounds, so z should be the value *inside* those bounds.
# Therefore, remove the last value from the z array.
z = z[:-1, :-1]
levels = MaxNLocator(nbins=15).tick_values(z.min(), z.max())


# pick the desired colormap, sensible levels, and define a normalization
# instance which takes data values and translates those into levels.
cmap = plt.get_cmap('PiYG')
norm = BoundaryNorm(levels, ncolors=cmap.N, clip=True)

plt.subplot(2, 1, 1)
im = plt.pcolormesh(x, y, z, cmap=cmap, norm=norm)
plt.colorbar()
# set the limits of the plot to the limits of the data
plt.axis([x.min(), x.max(), y.min(), y.max()])
plt.title('pcolormesh with levels')



plt.subplot(2, 1, 2)
# contours are *point* based plots, so convert our bound into point
# centers
plt.contourf(x[:-1, :-1] + dx / 2.,
             y[:-1, :-1] + dy / 2., z, levels=levels,
             cmap=cmap)
plt.colorbar()
plt.title('contourf with levels')


plt.show()

/Users/mglerner/coding/matplotlib/examples/images_contours_and_fields/streamplot_demo_features.py

In [5]:
"""
Demo of the `streamplot` function.

A streamplot, or streamline plot, is used to display 2D vector fields. This
example shows a few features of the stream plot function:

    * Varying the color along a streamline.
    * Varying the density of streamlines.
    * Varying the line width along a stream line.
"""
import numpy as np
import matplotlib.pyplot as plt

Y, X = np.mgrid[-3:3:100j, -3:3:100j]
U = -1 - X**2 + Y
V = 1 + X - Y**2
speed = np.sqrt(U*U + V*V)

plt.streamplot(X, Y, U, V, color=U, linewidth=2, cmap=plt.cm.autumn)
plt.colorbar()

f, (ax1, ax2) = plt.subplots(ncols=2)
ax1.streamplot(X, Y, U, V, density=[0.5, 1])

lw = 5*speed/speed.max()
ax2.streamplot(X, Y, U, V, density=0.6, color='k', linewidth=lw)

plt.show()

/Users/mglerner/coding/matplotlib/examples/images_contours_and_fields/streamplot_demo_masking.py

In [6]:
"""
Demo of the streamplot function with masking.

This example shows how streamlines created by the streamplot function skips
masked regions and NaN values.
"""
import numpy as np
import matplotlib.pyplot as plt

w = 3
Y, X = np.mgrid[-w:w:100j, -w:w:100j]
U = -1 - X**2 + Y
V = 1 + X - Y**2
speed = np.sqrt(U*U + V*V)

mask = np.zeros(U.shape, dtype=bool)
mask[40:60, 40:60] = 1
U = np.ma.array(U, mask=mask)
U[:20, :20] = np.nan

plt.streamplot(X, Y, U, V, color='r')

plt.imshow(~mask, extent=(-w, w, -w, w), alpha=0.5,
           interpolation='nearest', cmap=plt.cm.gray)

plt.show()
In []:

Comments from old blog

2 Responses to Making IPython Notebooks for the matplotlib examples

  1. Daniel says:

    Your IPyNB class could be made a lot more readable by using a custom JSON encoder, e.g. extending JSONEncoder http://docs.python.org/2/library/json.html#json.JSONEncoder. That way, you could store the preamble/afterward as a Python dict and just call json.dumps(..., default=MyJSONEncoder) inside tostring().

Leave a Reply

Logged in as mglerner. Log out?

Histograms and Kernel Density Estimation (KDE)

[twitter-follow username="mglerner" scheme="dark"]

(Note: you can download the IPython Notebook here.)

Why histograms

As we all know, Histograms are an extremely common way to make sense of discrete data. Whether we mean to or not, when we're using histograms, we're usually doing some form of density estimation. That is, although we only have a few discrete data points, we'd really pretend that we have some sort of continuous distribution, and we'd really like to know what that distribution is. For instance, I was recently grading an exam and trying to figure out what the underlying distribution of grades looked like, whether I should curve the exam, and, if so, how I should curve it.

I'll poke at this in an IPython Notebook; if you're doing this in a different environments, you may wish to uncomment out the commented lines so that your namespace is properly polluted.

In [1]:
from __future__ import division
%pylab inline

grades = array((93.5,93,60.8,94.5,82,87.5,91.5,99.5,86,93.5,92.5,78,76,69,94.5,89.5,92.8,78,65.5,98,98.5,92.3,95.5,76,91,95,61.4,96,90))
junk = hist(grades)
Populating the interactive namespace from numpy and matplotlib

Why not histograms?

We can play around with the number of bins, but it's not totally clear what's going on with the left half of the grades.

In [2]:
junk = hist(grades,5)
In [3]:
junk = hist(grades,15)

So, maybe the histogram isn't the perfect tool for the job at hand. In fact, there are quite a few well-known problems with histograms. Shodor has a really nice histogram activity that lets you play around with data interactively. Rather than using Java or JavaScript directly, Jake Vanderplas has a great package called JSAnimation that lets us animate things directly in IPython Notebooks. I'll cheat a bit: since all I really need for this is a single slider, I can use JSAnimation to let us interact with data very similarly to the Shodor pages.

In [4]:
from JSAnimation.IPython_display import display_animation, anim_to_html

Before we start, I'll load in a few data sets. If you're interested, you can rerun this notebook with a different data set to see how it affects things. data_shodor is the "My Data" set from their histogram activity page, data_sat is the average SAT Math data from the same page, data_tarn is from Tarn Duong's fantastic KDE explanation (we'll get there), and simple_data is just a very simple data set.

In [5]:
data_tarn = array((2.1,2.2,2.3,2.25,2.4,2.61,2.62,3.3,3.4,3.41,3.6,3.8))
data_shodor = array((49,49,45,45,41,38,38,38,40,37,37,34,35,36,35,38,38,32,32,32,37,31,32,31,32,30,30,32,30,30,29,28,29,29,29,30,28,27,29,30,28,27,28,27,27,29,29,29,26,27,25,25,25,25,25,25,25,26,26,27))
data_sat = array((490,499,459,575,575,513,382,525,510,542,368,564,509,530,485,521,495,526,474,500,441,750,495,476,456,440,547,479,501,476,457,444,444,467,482,449,464,501,670,740,590,700,590,450,452,468,472,447,520,506,570,474,532,472,585,466,549,736,654,585,574,621,542,616,547,554,514,592,531,550,507,441,551,450,548,589,549,485,480,545,451,448,487,480,540,470,529,445,460,457,560,495,480,430,644,489,506,660,444,551,583,457,440,470,486,413,470,408,440,596,442,544,528,559,505,450,477,557,446,553,370,533,496,513,403,496,543,533,471,404,439,459,393,470,650,512,445,446,474,449,529,538,450,570,499,375,515,445,571,442,492,456,428,536,515,450,537,490,446,410,526,560,560,540,502,642,590,480,557,468,524,445,479))
simple_data = array((0,5,10))
data = grades

Two of the main problems with histograms are (1) you need to define a bin size (2) you need to decide where the left edge of the bin is.

Histogram bin size

Let's look at the effects of bin size on histograms.

Caveat: the code below is certainly not optimized. Ditto for all of the code in this notebook. I wrote it quickly and at the same time I learned what FuncAnimation does. In order to make this read more easily, I've included most of the code at the end. If you're running this interactively, run the cell at the end now!

Let's start with getHistBinNumAni. What does that do? Given a data set, it'll give us an interactive plot. By dragging the slider around, we can make a histogram with anywhere from 1 bin to some max (default: 20) number of bins. No matter how many bins we have, the actual data is shown in blue dots near the bottom. Here's what it looks like for the grades:

In [7]:
ani = getHistBinNumAni(data)
display_animation(ani, default_mode='once')
Out[7]:


Once Loop Reflect

So, obviously chosing the number of bins makes a huge difference in how we'd interpret the data.

Where do the histogram bins start?

One of the other big problems with histograms, especially for relatively small data sets, is that you have to choose where the left edge of the first bin goes. Do you center the bin around the first group of points? Do you make the left edge match up with the left-most data point? Let's make some plots to see how that can affect things, because it's a bit easier to understand what I'm going on about that way. We'll make a similar animation with getHistBinOffsetAni. As with the previous animation, drag the slider around. This time, we have the same number of bins, but the slider drags around the data relative to the bins (or vice versa, depending on how you think of it).

In [8]:
ani = getHistBinOffsetAni(data)
display_animation(ani, default_mode='once')
Out[8]:


Once Loop Reflect

KDE (Kernel Density Estimation) to the rescue!

Kernel density estimation is my favorite alternative to histograms. Tarn Duong has fantastic KDE explanation, which is well worth reading. The basic idea is that, if you're looking at our simple dataset (simple_data = array((0,5,10)), you might choose to represent each point as a rectangle:

In [9]:
bar(simple_data,ones_like(simple_data)*0.5,width=0.5,facecolor='grey',alpha=0.5)
junk = ylim(0, 2.0)

not so interesting so far, but what do we do when the rectangles get wide enough that they start to overlap? Instead of just letting them run over each other like

In [10]:
bar(simple_data,ones_like(simple_data)*0.5,width=6,facecolor='grey',alpha=0.5)
junk = ylim(0, 2.0)

and instead of coloring the overlap regions darker grey, we add the rectangles together. So, since each of the rectangles has height 0.5 in the above example, the dark grey regions should really have height 1.0. This idea is called "kernel density estimation" (KDE), and the rectangle that we're using is called the "kernel". If we wanted to draw a different shape at each point, we'd do so by specifying a different kernel (perhaps a bell curve, or a triangle).

KDE, rectangular kernel

Now let's try KDE with a rectangular kernel. This time, using getKdeRectAni, you get a slider controls the width of the kernel.

In [11]:
ani = getKdeRectAni(simple_data)
display_animation(ani, default_mode='once')
Out[11]:


Once Loop Reflect

play with the slider, and note what happens when you make it big enough that the rectangles start to overlap. By tuning the width of the rectangles, we can tune how finely or coarsely we're looking at the data. It's not so powerful with three data points, but check it out with the grades from above:

In [12]:
ani = getKdeRectAni(data)
display_animation(ani, default_mode='once')
Out[12]:


Once Loop Reflect

In my view, there's a sweet spot right around 1/8 or 1/9 of the way across the slider where there are three distinct peaks. It looks very much like a trimodal distribution to me. So far, this isn't totally automatic; we have to pick the width of our kernel, but it's obvoius that KDE can give us a much better view of the underlying data than histograms!

KDE, Gaussian kernel

As mentioned above, we can use a different kernel. One of the most common kernels to use is a Gaussian. Using getKdeGaussianAni:

Again, the slider controls kernel width.

In [13]:
ani = getKdeGaussianAni(data)
display_animation(ani, default_mode='once')
Out[13]:


Once Loop Reflect

This gives us a really nice picture of the data. Play around with the slider and see what you think.

Kernel width

Not only does KDE give us a better picture than histograms, but there turn out to be actual answers to the question of "how wide should my kernel be?" You can see, for instance, that making the kernel too narrow doesn't provide much more information than the raw data, while making it too large oversmooths the data, making it mostly look like a single kernel with some bits on the sides.

Daniel Smith has a really nice KDE module that chooses an optimal bandwidth and can be used with SciPy (scipy does have its own KDE module, but I've found Daniel's to be quite robust).

Other data sets

I highly recommend just playing around with other data sets using the above code. I was interested in playing around with income data, so I show how to grab that data from the IRS website below and play around a bit without comment. Enjoy!

Income data

Let's grab the income data from The IRS and make some plots.

In [14]:
import urllib
f = urllib.urlopen("http://www.irs.gov/file_source/pub/irs-soi/09incicsv.csv")
#"State_Code","County_Code","State_Abbrv","County_Name",   "Return_Num","Exmpt_Num","AGI","Wages_Salaries","Dividends","Interest"
irs2009 = loadtxt(f,delimiter=',',skiprows=1,usecols=(4,5,6,7,8,9))
agi2009 = irs2009[:,2]

Now try things like

In [15]:
ani = getHistBinNumAni(agi2009)
display_animation(ani, default_mode='once')
Out[15]:


Once Loop Reflect

Whoops, that's hard to make sense of. Let's use logs

In [16]:
la2009 = log(agi2009)
la2009 = la2009[-isnan(la2009)]
-c:1: RuntimeWarning: invalid value encountered in log

In [17]:
ani = getHistBinNumAni(la2009)
display_animation(ani, default_mode='once')
Out[17]:


Once Loop Reflect
In [18]:
ani = getKdeRectAni(la2009)
display_animation(ani, default_mode='once')
Out[18]:


Once Loop Reflect
In [19]:
ani = getKdeGaussianAni(la2009)
display_animation(ani,default_mode='once')
Out[19]:


Once Loop Reflect

In order to make this read more easily, I've put the bulk of the code below. You'll have to run it before the previous cells.

In [6]:
#!/usr/bin/env python

from numpy import histogram as nphistogram
#from numpy import array, linspace, zeros, ones, ones_like
#import numpy as np
#import matplotlib.pyplot as plt
#from matplotlib.pyplot import figure, hist, plot, ion, axes, title
from JSAnimation.IPython_display import display_animation, anim_to_html

from matplotlib import animation as animation


def getHistBinNumAni(data,totalframes=None,showpts=True):
    #ion()
    if totalframes is None:
        totalframes = min(len(data)-1,100)
    fig = figure()
    ax = fig.gca()

    n, bins, patches = hist(data, totalframes, normed=1, facecolor='green', alpha=0.0)
    if showpts:
        junk = plot(data,0.2*ones_like(data),'bo')
    def animate(i):
        n, bins = nphistogram(data, i+1, normed=False)
        #print n
        ax.set_ylim(0,1.1*n.max())
        for j in range(len(n)):
            rect,h = patches[j],n[j]
            #print h.max()
            x = bins[j]
            w = bins[j+1] - x
            rect.set_height(h)
            rect.set_x(x)
            rect.set_width(w)
            rect.set_alpha(0.75)
        #fig.canvas.draw()
    
    ani = animation.FuncAnimation(fig, animate, totalframes, repeat=False)
    return ani

def getHistBinOffsetAni(data,nbins=20,showpts=True):
    offsets = linspace(-0.5,0.5,50)
    totalframes = len(offsets)
    fig = figure()
    ax = fig.gca()

    n, _bins, patches = hist(data, nbins, normed=1, facecolor='green', alpha=0.0)
    if showpts:
        junk = plot(data,0.2*ones_like(data),'bo')
    # Obnoxious: find max number in a bin ever
    nmax = 1
    for i in range(totalframes):
        dx = (data.max() - data.min())/nbins
        _bins = linspace(data.min() - dx + offsets[i]*dx, data.max()+dx + offsets[i]*dx,len(data)+1)
        n, bins = nphistogram(data, bins=_bins, normed=False)
        nmax = max(nmax,n.max())
                               
    def animate(i):
        dx = (data.max() - data.min())/nbins
        # bins go from min - dx to max + dx, then offset.
        _bins = linspace(data.min() - dx + offsets[i]*dx, data.max()+dx + offsets[i]*dx,nbins)
        n, bins = nphistogram(data, bins = _bins, normed=False)
        ax.set_ylim(0,1.1*nmax)
        #ax.set_xlim(data.min()-dx,data.max()+dx)
        binwidth = bins[1] - bins[0]
        ax.set_xlim(bins[0]-binwidth,bins[-1] + binwidth)

        for j in range(len(n)):
            #continue
            rect,h = patches[j],n[j]
            #print h.max()
            x = bins[j]
            w = bins[j+1] - x
            rect.set_height(h)
            rect.set_x(x)
            rect.set_width(w)
            rect.set_alpha(0.75)
        fig.canvas.draw()    
    ani = animation.FuncAnimation(fig, animate, totalframes, repeat=False)
    return ani
#!/usr/bin/env python

from numpy import sqrt, pi, exp

def getKdeGaussianAni(data,totalframes=100, showpts=True):
    fig = figure()
    
    # Let's say 10000 points for the whole thing
    width = data.max() - data.min()
    left, right = data.min(), data.min() + (width)
    left, right = left - (totalframes/100)*width, right + (totalframes/100)*width
    
    ax = axes(xlim=(left,right),ylim=(-0.1,2))
    line, = ax.plot([], [], lw=2)
    if showpts:
        junk = plot(data,ones_like(data)*0.1,'go')

    
    numpts = 10000
    x = linspace(left,right,numpts)
    
    dx = (right-left)/(numpts-1)
    
    def init():
        line.set_data([], [])
        return line,
    
    def gaussian(x,sigma,mu):
        # Why isn't this defined somewhere?! It must be!
        return (1/sqrt(2*pi*sigma**2)) *  exp(-((x-mu)**2)/(2*sigma**2))
    
    def animate(i):
        y = zeros(10000)
        kernelwidth = .02*width*(i+1)
        kernelpts = int(kernelwidth/dx)
        kernel = gaussian(linspace(-3,3,kernelpts),1,0)
        #kernel = ones(kernelpts)
        for d in data:
            center = d - left
            centerpts = int(center/dx)
            bottom = centerpts - int(kernelpts/2)
            top = centerpts+int(kernelpts/2)
            if top - bottom < kernelpts: top = top + 1
            if top - bottom > kernelpts: top = top - 1
            y[bottom:top] += kernel
        ax.set_xlim(x[where(y>0)[0][0]],x[where(y>0)[0][-1]])
        line.set_data(x,y)
        ax.set_ylim(min(0,y.min()),1.1*y.max())
        #title('ymin %s ymax %s'%(y.min(),y.max()))

    
        #sleep(0.1)
        return line,
    ani = animation.FuncAnimation(fig, animate, init_func=init,
                                  frames=totalframes, repeat=False)
    return ani
#FACTOR ME for rect and gaussian
def getKdeRectAni(data,totalframes=100,showpts=True):
    #ion()
    totalframes = 100
    fig = figure()
    
    # Let's say 10000 points for the whole thing
    width = data.max() - data.min()
    left, right = data.min(), data.min() + (width)
    left, right = left - (totalframes/100)*width, right + (totalframes/100)*width
    
    ax = axes(xlim=(left,right),ylim=(-0.1,2))
    line, = ax.plot([], [], lw=2)
    
    numpts = 10000
    x = linspace(left,right,numpts)
    
    dx = (right-left)/(numpts-1)
    
    def init():
        line.set_data([], [])
        return line,

    if showpts:
        junk = plot(data,0.2*ones_like(data),'bo')
    
    def animate(i):
        y = zeros(10000)
        kernelwidth = .02*width*(i+1)
        kernelpts = int(kernelwidth/dx)
        kernel = ones(kernelpts)
        for d in data:
            center = d - left
            centerpts = int(center/dx)
            bottom = centerpts - int(kernelpts/2)
            top = centerpts+int(kernelpts/2)
            if top - bottom < kernelpts: top = top + 1
            if top - bottom > kernelpts: top = top - 1
            y[bottom:top] += kernel
        line.set_data(x,y)
        ax.set_ylim(0,1.1*y.max())
        ax.set_xlim(x[where(y>0)[0][0]],x[where(y>0)[0][-1]])
    
        #sleep(0.1)
        return line,
    ani = animation.FuncAnimation(fig, animate, init_func=init,
                                  frames=totalframes, repeat=False)
    return ani

And that's it. Cheers!

Comments from old blog

13 Responses to Histograms and Kernel Density Estimation (KDE)

  1. Arindam Paul says:

    Excellent !! One of the best explanations of KDE I have ever seen.
    This post has generated enough interest to read your other blogs. great job.

  2. Nils Wagner says:

    Assume that we have a spatial energy distribution given at discrete points in 3-D, i.e.

    E_i(x_i,y_i,z_i)

    where E_i denotes the energy and x_i,y_i,z_i are the corresponding coordinates.

    Is it possible to extract the local hot spots using scipy ?

    A small example is appreciated.

    Thanks in advance.

  3. domain says:

    It's really a cool and helpful piece of info. I'm happy that you simply shared this helpful information with us.
    Please stay us up to date like this. Thanks for sharing.

  4. Andreas says:

    Thanks for sharing your knowledge and interpretation of kernel density estimation with us. Very enlighting.

  5. gmas says:

    If I try to run your notebook, I get this name error:


    NameError Traceback (most recent call last)
    in ()
    ----> 1 ani = getHistBinNumAni(data)
    2 display_animation(ani, default_mode='once')

    NameError: name 'getHistBinNumAni' is not defined

  6. X says:

    Is there a way to fit data to an exponential distribution such that it maximizes the entropy H(p_i) = - sum p_i*log(p_i) where p_i is the probability of a given event?

  7. Pingback: Histogram to PDF (Edit)

  8. Pingback: Histogram to PDF (Edit)

  9. Ben says:

    Fantastic explanation!
    Best KDE description I've found so far!
    Keep up the good work!

  10. Just wanted to say this website is extremely good. I always want to hear new things about this because I’ve the similar blog during my Country with this subject which means this help´s me a lot. I did so a search around the issue and located a large amount of blogs but nothing beats this. Many thanks for sharing so much inside your blog..

Leave a Reply

Logged in as mglerner. Log out?

Drum head normal modes, with movies

We’re working our way through Boas in my Mathematical Physics class, and we’ve come to the point in the PDE chapter where every good Physics student figures out what the normal modes of a circular drum head ought to look like. Punchline:

Drum heads

So, we get to the intuitive answer that we’re looking for $$J_n(k_{mn}r)\cos(n\theta)\cos(kvt)$$, where $$k_{mn}$$ is the $$m^{th}$$ zero of $$J_m$$ (we could also look for the $$\sin \sin$$, $$\sin \cos$$ or $$\cos \sin$$ solutions, but they’re not fundamentally different), and Boas has the standard picture of the nodelines to help with visualization. I’m becoming increasingly convinced that students should be competent with at least a little bit of Python. You could argue for Matlab or Mathematica, but I’m leaning heavily towards Python for a number of reasons[*]. The point is that things like “the sum of two specific traveling waves is a standing wave” are just easier to see if you can visualize and animate things, and they pack a bigger punch if you can do that yourself.

matplotlib animation interlude

I used standard matplotlib animations to show some basics with traveling waves, standing waves, Bessel functions, and building up series solutions piece by piece. The animation framework is excellent, because it’s simple to use and a traveling wave looks like this:

#!/usr/bin/env python
from __future__ import division
from numpy import sin, pi, linspace
import matplotlib.pyplot as plt
from matplotlib import animation

# First set up the figure, the axis, and the plot element we want to animate
v = -0.01
fig = plt.figure()
ax = plt.axes(xlim=(-4, 4), ylim=(-2, 2))
line, = ax.plot([], [], lw=2, color=’red’)

# initialization function: plot the background of each frame
def init():
    line.set_data([], [])
    return line,

# animation function.  This is called sequentially
x = linspace(-4, 4, 2000)

def animate(t):
    y = sin(2 * pi * (x – v * t))
    line.set_data(x, y)
    return line,

# call the animator.  blit=True means only re-draw the parts that have changed.
anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=100, interval=20, blit=True)

plt.show()

I can talk the students through it, and there are really only two interesting lines: the v=-0.01 to set the velocity and the y = sin(2 * pi * (x – v * t)) to do the math. Students can change the sign of the velocity to see it move in the other direction, change the math line to have some series solution, etc. in a straightforward way. We did that with our solutions for plucked and hammered strings (guitar vs. piano). It made for a nice in-class demo, and I’l have the students more involved from the start of the semester next time. There’s code for simple waves in my github repo, and you can find lots of decent animations on the net.

back to drum heads

What you can’t seem to find (or, at least, I can’t Edit: I knew the web had to have a nice demo, and here’s one using Mathematica) is animated versions of the normal modes we want ($$J_n(k_{mn}r)\cos(n\theta)\cos(kvt)$$). The math part is easy, especially because scipy.special has such nice things built in. The main part just involves

from scipy.special import jn, jn_zeros
from numpy import cos

def k(m,n):
    return jn_zeros(n,m)[m-1] # m is 0-indexed here

result = jn(n,k(m,n)*R)*cos(n*theta)*cos(k(m,n)*v*t)

The animation part is a bit trickier because I really want 3D plots in addition to the 2D ones, and the matplotlib 3d canvas doesn’t play nicely with the animation framework. It’s still not particularly tricky. When we plot something, we get an object back. We usually ignore that object. In order to “animate” things, we just keep track of that object and delete it each time through our main loop. I ended up with two versions of the code. The one below has a couple of nice features. It animates a single mode, but you can specify that on the command line. Optionally, you can also specify “cc” for “cos cos”, “cs” for “cos sin” etc.

The funny business with zlim near the end is because matplotlib sets the z limits of the plots automatically, each time you plot something. In our case, it looks much smoother to just start the z axis at (-1,1) instead of watching it stretch throughout the first oscillation.

#!/usr/bin/env python
from __future__ import division
“”"
Based on the wireframe example script
“”"
from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt
from matplotlib import cm
from numpy import sin, cos, arctan, arctan2, array, sqrt, linspace, meshgrid
import scipy
import scipy.special
from scipy.special import jn, jn_zeros
import time, sys, os

v = 1
n, m = 1, 1
if len(sys.argv) >= 3:
    n,m = [int(i) for i in sys.argv[1:3]]
if m == 0:
    sys.exit(“M must be greater than or equal to 1″)

if len(sys.argv) >= 4:
    fns = sys.argv[3]
else:
    fns = ‘cc’
    
f1 = {‘s’:sin,’c':cos}[fns[0]]
f2 = {‘s’:sin,’c':cos}[fns[1]]

def k(m,n):
    return jn_zeros(n,m)[m-1] # m is 0-indexed here

def generate(X, Y, t):
    theta = arctan2(Y,X) # This does arctan(Y/X) but gets the sign right.
    R = sqrt(X**2 + Y**2)
    # We know z = J_n(k*r)*cos(n*theta)*cos(k*v*t)
    # 
    result = jn(n,k(m,n)*R)*f1(n*theta)*f2(k(m,n)*v*t)
    result[R>1] = 0 # we plot points from the square, but physically require this.
    return result

plt.ion()
fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)

xs = linspace(-1, 1, 100)
ys = linspace(-1, 1, 100)
X, Y = meshgrid(xs, ys)
Z = generate(X, Y, 0.0)

wframe = None
tstart = time.time()
for t in linspace(0, 10, 400):
    oldcol = wframe
    Z = generate(X, Y, t)
    #wframe = ax.plot_wireframe(X, Y, Z, rstride=2, cstride=2)
    wframe = ax.plot_surface(X, Y, Z, rstride=4, cstride=4, alpha=0.3)

    # Remove old line collection before drawing
    if oldcol is not None:
        ax.collections.remove(oldcol)
    if n == 0:
        ax.set_zlim(-1,1)
    else:
        ax.set_zlim(-0.5,0.5)
    plt.draw()

The individual frames look something like

A frame from k11

I also made a beefier version (you can find it in my github repo) that plots several of the modes in a grid. The different columns are m=1,2,3 and rows are n=1,2. The script lets you plot as many or as few of these as you’d like. It shows a 2D version next to the 3D version, and the nodelines are shown in black on the 2D version.

I also followed The Glowing Python’s idea of writing out a bunch of png files for each frame and making a movie with ffmpeg:

ffmpeg -q:a 5 -r 5 -b:v 19200 -i img%04d.png movie.mp4

Here’s the movie:

and you can download a slightly higher quality version here. You can download a slightly higher quality of the movie at the top of this page (2 values each of m and n) here.

[*] Among other things:

  • Python is free, which is important at a small school
  • I want this to show up in a computational modeling class, and I want that class to draw in CS, biology, chemistry, biochemistry, math, econ and geology students. I think Python is an easier cross-department sell
  • The IPython Notebook is fantastic these days

March Madness, Monte Carlo Style!

I’m teaching Thermal Physics this term, so obviously I rearranged the syllabus so that we could all run Monte Carlo simulations for March Madness!

Quick summary for the class is at the end

The basics

We started off writing an energy function and figuring out how to play a simple 8-team bracket (side-note: IPython notebooks were great for this purpose! We were able to record our class sessions and post them online, along with a “Scientific Cookbook”). This version of Thermal Physics doesn’t have a programming prerequisite, so the coding parts of the assignments stopped with running and analyzing 8-team brackets.

We then spent a chunk of class time comparing our simulations with the 2-state paramagnets we’d seen in Schroeder’s text (we covered section 8.2 a bit out of sequence). In that context, it was clear what a Monte Carlo move was. We convinced ourselves that we could do something similar in bracket space, but that we couldn’t just flip a single game (we also had to consider flipping the games that depended on that game). The coding for that part was definitely beyond the scope of this class, but I hacked up an ugly version for us to use.

Warnings to the programmers in audience: this code is very much not pretty and very much not optimized. In particular, the Bracket code is really slow, to the point where it takes a noticeable amount of time to run 10000 brackets. The main goals were (1) get something written (2) make it easy for the students to modify the parts they need to modify. Please feel free to send comments! If I get more than 1, I’ll consider stuffing things on github, etc.

So, to run things, download all of the Python files Brackets.py JeffSagarin.py KenPomeroy.py MonteCarloBrackets.py RankingsAndStrength.py Stats.py Visualization.py decorators.py from This directory on my website.

EDIT: See the comments. Daniel Smith has made some improvements and started a github repo, which I’ve forked.

wget http://mglerner.com/MarchMadness/Brackets.py
wget http://mglerner.com/MarchMadness/JeffSagarin.py
wget http://mglerner.com/MarchMadness/KenPomeroy.py
wget http://mglerner.com/MarchMadness/MonteCarloBrackets.py
wget http://mglerner.com/MarchMadness/RankingsAndStrength.py
wget http://mglerner.com/MarchMadness/Stats.py
wget http://mglerner.com/MarchMadness/Visualization.py
wget http://mglerner.com/MarchMadness/decorators.py

You then need to have matplotlib installed. I highly recommend the Enthought Python Distribution.

From the IPython prompt (making sure you’re in the directory that has all of the downloaded files), you can look at a single bracket like this:

run MonteCarloBrackets.py
teams = RAS.teams['south']
b = Bracket(teams=teams,T=0.5)
print b

which will print out something like

Kansas (1)                                                   
Western Kentucky (16)     Kan (1)                            
North Carolina (8)                                           
Villanova (9)             Vil (9)  Vil (9)                   
Virginia Commonwealth (5)                                    
Akron (12)                Vir (5)                            
Michigan (4)                                                 
South Dakota St. (13)     Mic (4)  Vir (5)  Vil (9)          
UCLA (6)                  UCL (6)  Flo (3)  Geo (2)  Geo (2) 
Minnesota (11)                                               
Florida (3)               Flo (3)                            
Northwestern St. (14)                                        
San Diego St. (7)         Okl (10) Geo (2)                   
Oklahoma (10)                                                
Georgetown (2)            Geo (2)                            
Florida Gulf Coast (15)                                      
Total bracket energy: -17.7885496023

By default, the energy is calculated by the function

import RankingsAndStrength as RAS
strength = RAS.kenpom['Pyth'] # Use Ken Pomeroy’s “Pyth”
def energy_game(winner,loser):
    result = -(strength[winner]/strength[loser])
    return result

and you can define your own energy function fairly easily, including the coin-flipping favorite

def energy_game(winner,loser):
    result = random()
    return result

While we’re talking about rankings, one quick note: one of the class-favorite energy functions was based on the team rankings. The official tournament rankings are only defined in groups of 16 (i.e the South, the East, the Midwest and the West). So, in order to compare them across regions, I add a bit of randomness in the middle of RankingsAndStrength.py:

for region in teams:
    for (team,rank) in zip(teams[region],_rankings):
        regional_rankings[team] = rank + random()/10

So, if you just want one bracket based on your energy function, you can run the above code, but with teams=”all”. You are now equipped to beat all of your friends in your bracket pool!

But wait! There’s more!

We went to the trouble of figuring out how to make moves in bracket space. So, we have a simulate(ntrials,region,T) function that starts at a random (but valid) point in bracket space and makes ntrials moves, using the Metropolis algorithm with your energy function and temperature T to accept or reject moves properly. There are a few more details including plotting the results, but here’s the important part of simulate:

def simulate(ntrials,region,T):
    teams = RAS.teams[region]
    b = Bracket(teams,T)
    ng = sum(b.games_in_rounds) # total number of games
    # Let’s collect some statistics
    brackets = []
    for trial in xrange(ntrials):
        g = choice(range(ng)) # choose a random game to swap
        #print “attempted swap for game”,g#,”in round”,round[g]
        newbracket = deepcopy(b)
        newbracket.swap(g)
        ediff = newbracket.energy() – b.energy()
        if ediff <= 0:
            b = newbracket
        else:
            if random() < exp(-ediff/T):
                b = newbracket
        brackets.append(b)

We keep track of all of the each bracket in the brackets list, and then use it to print out some statistics. So, what does that look like?

simulate(100,’south’,0.5)

which plots the following image
Teams from the South, 10 runs

Here, I’ve plotted at least a few useful things. Looking on the left side you can verify whether or not we’re seeing a Boltzmann distribution. You know that bracket space is really big (quiz: how big?). So, one way to assess our sampling is to keep track of the number of unique brackets we’ve seen. That’s on the top right. If we’ve fully sampled bracket space, you’d expect that to approach a horizontal asymptote. You can actually see that behavior if you run a small bracket with lots of trials (pro tip: one of the things I haven’t mentioned yet is that you can specify a list of teams like ["Kansas","Kentucky","Indiana","Duke"] as the second argument to simulate, instead of a region).

simulate also prints out both the lowest energy bracket possible with your energy function and the most probable bracket according to your Monte Carlo sampling (the one below showed up 6 times)

Most common bracket (6)
Kansas (1)                                                   
Western Kentucky (16)     Kan (1)                            
North Carolina (8)                                           
Villanova (9)             Vil (9)  Vil (9)                   
Virginia Commonwealth (5)                                    
Akron (12)                Vir (5)                            
Michigan (4)                                                 
South Dakota St. (13)     Mic (4)  Mic (4)  Vil (9)          
UCLA (6)                  UCL (6)  Flo (3)  Flo (3)  Flo (3) 
Minnesota (11)                                               
Florida (3)               Flo (3)                            
Northwestern St. (14)                                        
San Diego St. (7)         Okl (10) Flo (15)                  
Oklahoma (10)                                                
Georgetown (2)            Flo (15)                           
Florida Gulf Coast (15)                                      
Total bracket energy: -17.2487840302

Lowest energy bracket
Kansas (1)                                                   
Western Kentucky (16)     Kan (1)                            
North Carolina (8)                                           
Villanova (9)             Nor (8)  Kan (1)                   
Virginia Commonwealth (5)                                    
Akron (12)                Vir (5)                            
Michigan (4)                                                 
South Dakota St. (13)     Mic (4)  Mic (4)  Kan (1)          
UCLA (6)                  Min (11) Flo (3)  Flo (3)  Flo (3) 
Minnesota (11)                                               
Florida (3)               Flo (3)                            
Northwestern St. (14)                                        
San Diego St. (7)         San (7)  Geo (2)                   
Oklahoma (10)                                                
Georgetown (2)            Geo (2)                            
Florida Gulf Coast (15)                                      
Total bracket energy: -18.3666003991

You can, of course, run more simulations

simulate(10000,’south’,0.5)

Teams from the South, 10000 runs

Most common bracket (16)
Kansas (1)                                                   
Western Kentucky (16)     Kan (1)                            
North Carolina (8)                                           
Villanova (9)             Nor (8)  Nor (8)                   
Virginia Commonwealth (5)                                    
Akron (12)                Vir (5)                            
Michigan (4)                                                 
South Dakota St. (13)     Sou (13) Vir (5)  Nor (8)          
UCLA (6)                  Min (11) Flo (3)  Flo (3)  Flo (3) 
Minnesota (11)                                               
Florida (3)               Flo (3)                            
Northwestern St. (14)                                        
San Diego St. (7)         Okl (10) Okl (10)                  
Oklahoma (10)                                                
Georgetown (2)            Geo (2)                            
Florida Gulf Coast (15)                                      
Total bracket energy: -17.716285007

Lowest energy bracket
Kansas (1)                                                   
Western Kentucky (16)     Kan (1)                            
North Carolina (8)                                           
Villanova (9)             Nor (8)  Kan (1)                   
Virginia Commonwealth (5)                                    
Akron (12)                Vir (5)                            
Michigan (4)                                                 
South Dakota St. (13)     Mic (4)  Mic (4)  Kan (1)          
UCLA (6)                  Min (11) Flo (3)  Flo (3)  Flo (3) 
Minnesota (11)                                               
Florida (3)               Flo (3)                            
Northwestern St. (14)                                        
San Diego St. (7)         San (7)  Geo (2)                   
Oklahoma (10)                                                
Georgetown (2)            Geo (2)                            
Florida Gulf Coast (15)                                      
Total bracket energy: -18.3666003991

So, now maybe we’re ready to run a full bracket. I can think of at least two reasonable ways to do that. The first is just to run it:

simulate(10000,’all’,0.5)

All teams, 10000 runs

On the other hand, as you can see, that’s only sampling a tiny bit of bracket space. When I run it, the most common bracket is only seen 8 times out of the 10000, for instance.

Another strategy would be to simulate each of the regions a bunch of time (you can sit around and wait for 100,000 times on your own machine, etc.). Then you can run your own Final Four using a secret feature: you can specify a list of teams instead of a region:

simulate(1000,['Louisville','Gonzaga','Kansas','Indiana'],0.5)

Final 4, 1000 runs

Most common bracket (148)
Louisville (1)                             
Gonzaga (1)               Lou (1)          
Kansas (1)                Kan (1)  Lou (1) 
Indiana (1)                                
Total bracket energy: -3.02255561688

Lowest energy bracket
Louisville (1)                             
Gonzaga (1)               Lou (1)          
Kansas (1)                Ind (1)  Lou (1) 
Indiana (1)                                
Total bracket energy: -3.04693046285

You’d then piece together a complete bracket from your results (don’t forget to keep track of the regional brackets!).

These are, conveniently, already coded up for you via

runbracket1(ntrials=10000,T=1.0)

and

runbracket2(ntrials1=10000,ntrials2=1000,T=1.0)

What if you just want to place bets?

We learned something from Nate Silver’s political simulations. Most bets involve things like the probability that team X makes the final four. If you want to do that, you could do the same thing Silver does: run a ton of independent simulations, keep the results in a list, and then calculate statistics over that list. You can feel free to do so for a significant chunk of extra credit.

QUICK SUMMARY FOR THE CLASS:

- Download all of the python files from the website

- Change the energy function in MonteCarloBrackets.py to do what you want. For instance, you can use any combination of the strengths from Ken Pomeroy by using RAS.kenpom['Pyth'], etc. I’ve also added Jeff Sagarin’s ratings (RAS.sagarin['Rating']). Remember that you should all have a “silly” function (e.g. one of you has a dictionary that keeps track of how awesome a team’s mascott is). You can combine silly with non-silly as well. You get up to 10 brackets total.

- Generate brackets with either of the methods mentioned above (running with ‘all’ or running the regions individually and then running your own final 4). It’s up to you whether you use the lowest-energy bracket or the most common bracket, but you should justify your result.

- Send me the results! You can then go ahead and enter up to 10 brackets in our bracket pool. If you need to, you can have me enter them for you (this is most likely to happen if you don’t finish this until near the end of/after spring break). The winner will get something great (likely a home-baked treat!).

Comments from previous post

14 Responses to March Madness, Monte Carlo Style!

  1. Daniel says:

    Here is the code in module form on GitHub:

    https://github.com/Daniel-B-Smith/MarchMadnessMonteCarlo

    I did a few mini-optimizations, but the thing that is killing you in simulate() is using deepcopy() on your bracket object at every step. Refactoring so that you can call a shallow copy should speed things up considerably.

  2. mglerner says:

    Thanks, Daniel! I've forked this as https://github.com/mglerner/MarchMadnessMonteCarlo ... which is fairly fun, since I haven't used github before.

  3. Wayne says:

    Seems the last line for instructions on how to execute runbracket2 should read
    runbracket2(ntrials1=10000,ntrials2=1000,T=1.0)?

  4. Jessen says:

    Choosing the most common bracket optimizes your probability of picking a perfect bracket (which Warren Buffet would have given you $1B for last year - so that ain't nothing), but it doesn't necessarily optimize the chance of winning your office/class pool.

    One strategy is to find the bracket that has the highest expected score (many pools give more points for correctly predicting later rounds, as well as spoiler points for picking upsets) over the ensemble of brackets you've generated. Luckily, since the probability of making it to any round is conditionally independent of the opponent that you face there, it's sufficient to summarize the probability of making it to any round (which is exactly the table that fivethirtyeight.com publishes) to calculate an expected score. All that remains is to pick a bracket to optimize your expected score. Since you've already got some MC code written, you're most of the way to writing a simulated annealing optimizer for your pick!

    (Spoilers: I already did this... I'll tell you how terrible it is later.)

    • mglerner says:

      Jessen, that's awesome. My update for this year (http://nbviewer.ipython.org/github/mglerner/MarchMadnessMonteCarlo/blob/master/MMMC2015.ipynb is the same as http://www.mglerner.com/blog/?p=49) does make the same table as fivethirtheight.com.

      Good call about expected score.

      Also good call about simulated annealing. Have you done any looking into what bracket space actually looks like geometrically?

      Do let me know how your bracket turns out!

    • Daniel says:

      Optimizing for expected score is a strategy that will give you the highest score on average, but it won't give you the best odds of winning your pool. To win your pool, you need to play your opponents not the bracket. In a non-tiny pool, you basically have to pick the national champion correctly to win the pool. If you make the most popular pick, your odds of winning are still fairly low since a bunch of other people made the same pick. When picking your champion, you need to consider both the probability that that team wins the championship and the expected number of people in your pool picking that team.

      Slate publishes a story about this every year:

      http://www.slate.com/articles/sports/sports_nut/2015/03/ncaa_bracket_picks_2015_why_you_should_bet_against_kentucky_and_pick_arizona.html

      • Jessen says:

        Yup, you're right that optimizing expected score is not the optimal strategy, but I'd guess that for your typical office pool, it is a probably approximately correct (a real term, though I'm abusing it here) strategy. The strategies that the Slate article goes into don't start applying until the pool gets large enough (roughly when the probability that one of the other N-1 entrants picks a "lucky" mainstream bracket grows larger than the probability that a black swan bracket wins).

        Out of expediency I went with expected score calculations, but here's a potentially better strategy:
        1) Generate an ensemble of potential outcomes
        2) Generate an ensemble of potential competitive brackets (good source data here: https://tournament.fantasysports.yahoo.com/t1/pickdistribution)
        3) Estimate your bracket's probability of winning using those two ensembles (this will depend on the size of the pool)
        4) Optimize!

        For small N, I think this looks more like score maximization. For large N, I believe this is approximately equivalent to the "hedge fund" strategy of picking the black swans. I don't know at what N the crossover happens, but I'd guess it's closer to 100 than to 20. One other thought: I think you can get an additional edge from the observation that people's picks from round to round are not independent (witness mglerner's unhealthy obsession with Kansas 🙂 ). Not sure how to model that, but I'd guess I'm getting into third- or fourth- order effects at this point.

        Now, to empirical results:
        I've entered my expected-points-maximizing bracket into two pools with different scoring schemes (one a standard scoring scheme with upset points, and the other is too convoluted and byzantine to explain here). Kentucky was picked to win in 5 out of 14 and 7 out of 18 of the brackets, respectively (including both of my brackets), so I feel like I'm still in the realm where maximizing points matters more than worrying about monkeys at typewriters. I'm sitting at 3rd place in both pools, though within spitting distance of 1st in both. Unfortunately, my two brackets have Villanova and Virginia in the finals, so things are not looking so bright.

        However, I'm still doing better than last year, where my strategy was more akin to "do I know someone who went there?"

        • Daniel says:

          I think Kentucky this year is a great example of where the points maximizing strategy breaks down even for small pools. From the small pools I've seen, >50% of brackets pick Kentucky to win in spite of them having an estimated ~35% chance of winning. Even in a small pool, you are much more likely to win having picked a team like Arizona because your odds of winning picking Kentucky even conditional on Kentucky winning are pretty low.

          • mglerner says:

            It's funny that this is a bit different from poker, where many people vastly overestimate the need for exploitative play.

          • Jessen says:

            Yeah, I think you're right. Basically, I'm counting on the triage opportunities in the earlier rounds to push my brackets over the top of the others. If there isn't enough inefficiency in other people's picks, it's better to bet on the undervalued champion. Wish I had time to do a more thorough analysis.

Leave a Reply

Logged in as mglerner. Log out?

Getting a sensible Python install on my Mac

I recently had to get Python + things I like up and running on my Mac from scratch. That’s a significantly harder task than you might guess, so one of the main reasons I’m posting this is for my own benefit (I’ll have to do this all again in February when I give my current laptop back to NIH and buy one of my own). There’s a decent chance this will be useful for others, though. A significant amount of experience and gnashing of teeth has told me

1) ActiveState is awesome, but not everything works perfectly with it.
2) Ditto for Python 2.7.
3) Never mix macports and fink.

So, here’s my setup for a working python2.6 system that happens to meet my current needs:

0) Fully uninstall macports (http://guide.macports.org/#installing.macports.uninstalling). Follow the full set of directions.

1) Install the latest and greatest macports, using the package install and the easy options (http://guide.macports.org/#installing.macports).

2) Make variants.conf reject all python2.7-related things
[lernerm@LCBB-ML01777274 ~]$ cat /opt/local/etc/macports/variants.conf
# To specify global variants to use for all port builds,
# customize this file to list variant settings you want.
#
# Any variants specified here that are not supported by
# a port will just be ignored. Multiple variants can be
# specified per line, or one per line is also allowed.
#
# Example:
# +ipv6 +no_x11
+py26 +python26 -py27 -python27 +bash_completion

3) Make sure you can install some basic things. I start with bash completion (https://trac.macports.org/wiki/howto/bash-completion) and coreutils.


sudo port install bash-completion
sudo port install coreutils +with_default_names

Note that you’ll have to edit your .bash_profile to make bash-completion work. It’s documented in the link above. You may also have to add /opt/local/libexec/gnubin to your path to get coreutils to work.

4) Install the Python environment I want, including
- python 2.6
- matplotlib
- numpy
- scipy
- ipython
- boost
- cgal
- cmake
- openbabel


sudo port install python26
sudo port select –set python python26
sudo port install py26-numpy py26-scipy py26-matplotlib py26-ipython py26-openbabel cmake boost cgal py26-wxpython py26-cython
sudo port select –set ipython ipython26

Change the noninteractive matplotlib default backend:


[lernerm@LCBB-ML01777274 ~]$ cat ~/.matplotlib/matplotlibrc
backend : WXAgg

5) Mercurial. I can’t use the macports package, because it forces a python2.7 install! The download from http://mercurial.selenic.com/downloads/ is fine, but I’d really rather have it as a port for consistency/upgrade reasons.

6) Then install additional packages I want

6a) numpy sharedmem (https://bitbucket.org/cleemesser/numpy-sharedmem/overview)


hg clone https://bitbucket.org/cleemesser/numpy-sharedmem
cd numpy-sharedmem
python setup.py build
python setup.py install –prefix=~/software # Note: that’s in my path already

6b) dionysus (http://www.mrzv.org/software/dionysus/get-build-install.html)


hg clone http://hg.mrzv.org/Dionysus/
cd Dionysus
hg up tip

mkdir build
cd build
cmake ..

add the following to CMakeLists.txt:


set(CMAKE_LIBRARY_PATH /opt/local/lib ${CMAKE_LIBRARY_PATH} )
set(CGAL_DIR /opt/local/lib/cmake)

That’s not enough. Edit CMakeCache.txt by hand to point at the right Python by searching for /System/Library/Frameworks/Python.framework and replacing it with /opt/local/Library/Frameworks/Python.framework and then, finally,


make

And then make sure that $HOME/src/Dionysus/build/bindings/python is in your PYTHONPATH.

6c) orange (from source http://orange.biolab.si/nightly_builds.html)


python setup.py build
python setup.py install –user

Note that the last bit requires you to have this in your .bashrc or .bash_profile (or both, or maybe they’re the same file because you’ve symlinked them!):


export PYTHONPATH=$HOME/.local/lib/python2.6/site-packages:$PYTHONPATH

Whew!

Comments from previous blog

  • From mglerner: I suppose numpy sharedmem could be done with python setup.py install --user. I tend to like --prefix=~/software, but that mysteriously didn't work for orange, so I'll probably switch everything over to --user for consistency.
  • From Anthony: But does Dionysus work? Finally?
  • From mglerner: Yes! Finally! I had this post 3/4 written for quite a while before I filled in the missing dionysus pieces.
  • From Anthony: So now that you're moving to python 2.7 you should update this. Do you have everything working yet?
  • From Ernest Yeung:
      Prof. Lerner, I was trying to get Dionysus to work, using Mercurial which went without a hitch, to work on my MacBook Air 13in, 2011, Mac OS X 10.8. Cmake and Boost were installed as required by mrzv.org (the author of Dionysus).
    
    I'm having problems at the build stage. By the way, what good directories should I put all this stuff in and should I be in administrator or a regular account to do all this?
    cd build % Yep, in Dionysus dir
    cmake ..
    
    That's when I get problems:
    CMake Error at /Applications/CMake 2.8-11.app/Contents/share/cmake-2.8/Modules/FindBoost.cmake:1106 (message):
    Unable to find the requested Boost libraries.
    
    Unable to find the Boost header files. Please set BOOST_ROOT to the root
    directory containing Boost or BOOST_INCLUDEDIR to the directory containing
    Boost's headers.
    Call Stack (most recent call first):
    CMakeLists.txt:13 (find_package)
    
    -- Could NOT find Doxygen (missing: DOXYGEN_EXECUTABLE)
    -- CGAL not found, therefore alphashapes will not be built.
    CMake Warning (dev) at examples/homology-zigzags/CMakeLists.txt:9 (add_executable):
    Policy CMP0002 is not set: Logical target names must be globally unique.
    Run "cmake --help-policy CMP0002" for policy details. Use the cmake_policy
    command to set the policy and suppress this warning.
    This warning is for project developers. Use -Wno-dev to suppress it.
    
    -- CGAL not found, alphashape bindings will not be built
    CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
    
    I installed boost onto the previous directory from my Dionysus. I installed it opening up the .tar.bz2 downloaded from the boost website. In your experience, does this work well?
    
    I'm not familiar the problems with CGAL. Do you have any knowledge?
    
    Maybe another way I should ask about all this is, if I only wanted a very minimal installation, all I want is Dionysus, with Mercurial, and of course what the author required CMake, Boost, how should I go about doing so?
    
    Many thanks, I'll write up what I learned on my blog and it may help others as well on a Mac.
    
    I share your passion for python.
    
    Facebook : ernestyalumni
    gmail : ernestyalumni
    linkedin : ernestyalumni
    tumblr : ernestyalumni
    twitter : ernestyalumni
    weibo 微博 : ernestyalumni
    youku : ernestyalumni
    youtube : ernestyalumni
    indiegogo : ernestyalumni